=Paper= {{Paper |id=Vol-190/paper-3 |storemode=property |title=Propagating Trust and Distrust to Demote Web Spam |pdfUrl=https://ceur-ws.org/Vol-190/paper03.pdf |volume=Vol-190 |authors=Baoning Wu,Vinay Goel and Brian Davison |dblpUrl=https://dblp.org/rec/conf/mtw/WuGD06 }} ==Propagating Trust and Distrust to Demote Web Spam== https://ceur-ws.org/Vol-190/paper03.pdf
      Propagating Trust and Distrust to Demote Web Spam

                                 Baoning Wu           Vinay Goel      Brian D. Davison
                                     Department of Computer Science & Engineering
                                                   Lehigh University
                                              Bethlehem, PA 18015 USA
                                     {baw4,vig204,davison}@cse.lehigh.edu



ABSTRACT                                                            Trust can be used to combat Web spam. Gyöngyi et
Web spamming describes behavior that attempts to deceive         al. [13] present the TrustRank algorithm based on this idea.
search engine’s ranking algorithms. TrustRank is a recent        This technique assumes that a link between two pages on
algorithm that can combat web spam by propagating trust          the Web signifies trust between them; i.e., a link from page
among web pages. However, TrustRank propagates trust             A to page B is a conveyance of trust from page A to page
among web pages based on the number of outgoing links,           B. In this technique, human experts, initially, select a list of
which is also how PageRank propagates authority scores           seed sites that are well-known and trustworthy on the Web.
among Web pages. This type of propagation may be suited          Each of these seed sites is assigned an initial trust score. A
for propagating authority, but it is not optimal for calculat-   biased PageRank [23] algorithm is then used to propagate
ing trust scores for demoting spam sites.                        these trust scores to the descendants of these sites. The au-
  In this paper, we propose several alternative methods to       thors observed that on applying this technique, good sites
propagate trust on the web. With experiments on a real web       had relatively high trust scores, while spam sites had low
data set, we show that these methods can greatly decrease        trust scores.
the number of web spam sites within the top portion of the          TrustRank shows that the idea of propagating trust from
trust ranking. In addition, we investigate the possibility of    a set of highly trusted seed sites helps a great deal in the de-
propagating distrust among web pages. Experiments show           motion of Web spam. But TrustRank is just one implemen-
that combining trust and distrust values can demote more         tation of this idea. This approach makes certain assump-
spam sites than the sole use of trust values.                    tions with regard to how trust is propagated from a parent
                                                                 page to a child page. For example, the authors claim that
                                                                 the possibility of a page pointing to a spam page increases
Categories and Subject Descriptors                               with the number of links the pointing page has. Because of
H.3.3 [Information Storage and Retrieval]: Information           this, they proposed the idea that the trust score of a parent
Search and Retrieval                                             page be equally split amongst its children pages.
                                                                    This assumption is open to argument. Why should two
                                                                 equally trusted pages propagate different trust scores to
General Terms                                                    their children just because one made more recommendations
Algorithms, Performance                                          than the other? Also, with respect to the accumulation of
                                                                 trust scores from multiple parents, TrustRank puts forth
Keywords                                                         just one solution, that of simple summation. Clearly, there
                                                                 are other alternatives.
Web spam, Trust, Distrust, PageRank, TrustRank
                                                                    A natural extension of the idea of the conveyance of trust
                                                                 between links is that of the conveyance of distrust. Here, dis-
1.   INTRODUCTION                                                trust has a different meaning to that in the context of social
  In today’s Web, a link between two pages can be consid-        networks. In social networks, distrust between two nodes A
ered to be an implicit conveyance of trust from the source       and B usually means that A shows distrust explicitly to B.
page to the target page. In this case, trust implies that        In contrast, in our system, distrust is a penalty awarded to
the author of the source page believes that the target page      the source page for linking to an untrustworthy page. Hence,
provides some content value.                                     this distrust is an indication that we don’t trust some web
  With the increasing commercial interest of being ranked        pages, not an indication that one page doesn’t trust another
high in search engine results, content providers resort to       page on the web. Actually, the trust score of a page can also
techniques that manipulate these results. This behavior is       be interpreted as how much we trust this page.
usually termed Web spam, or search engine spam. Many                In general, spam pages can be considered to be one type
kinds of spam have been discovered [24, 12, 5]. Henzinger        of untrustworthy pages. To elaborate on this idea, consider
et al. [15] mention that Web spam is one of the major chal-      that a page links to another page and hence according to the
lenges faced by search engines. There is no universal method     above definition of trust, this page expresses trust towards
that can detect all kinds of spam at the same time.              the target page. But if this target page is known to be a
                                                                 spam page, then clearly the trust judgment of the source
                                                                 page is not valid. The source page needs to be penalized
for trusting an untrustworthy page. It is likely that the          is that a page will get high BadRank value if it points to
source page itself is a spam page, or is a page that we believe    some pages with high BadRank value. This idea is similar
should not be ranked highly for its negligence in linking to       in spirit to our mechanism of propagating distrust in this
an untrustworthy page.                                             paper.
   In this paper, we explore the different issues present in the
problem of propagating trust on the Web. We also study the         3. RELATED WORK
application of propagating distrust on the Web. Addition-
                                                                      While the idea of a focused or custom PageRank vector
ally, we present techniques to combine trust and distrust
                                                                   has existed from the beginning [23], Haveliwala [14] was
scores to improve the overall performance in demoting Web
                                                                   the first to propose the idea of bringing topical information
spam.
                                                                   into PageRank calculation. In his technique, pages listed
   The rest of this paper is organized as follows: the back-
                                                                   in DMOZ [22] are used as the seed set to calculate the bi-
ground and related work will be introduced in Section 2
                                                                   ased PageRank values for each of the top categories. Then a
and Section 3 respectively. The motivation of this work will
                                                                   similarity value of a query to each of these categories is cal-
be introduced in Section 4. The details of our technique
                                                                   culated. A unified score is then calculated for each page con-
are given in Section 5. The experiments and results will be
                                                                   taining the given query term(s). Finally, pages are ranked
shown in Section 7. We finish with discussion and conclusion
                                                                   by this unified score. Experiments show that Topic-sensitive
in Sections 8 and 9.
                                                                   PageRank has better performance than PageRank in gener-
                                                                   ating better response lists to a given query.
2.   BACKGROUND                                                       Jeh and Widom [17] specialize the global notion of impor-
                                                                   tance that PageRank provides to create personalized views
2.1 Matrix Definition                                              of importance by introducing the idea of preference sets.
   The web can be represented by a directed graph, given           The rankings of results can then be biased according to this
web pages as the nodes and hyperlinks among web pages as           personalized notion. For this, they used the biased PageR-
the directed links among the nodes. The adjacency matrix           ank formula.
M of the web graph is: M [i, j] equals 1 if there is a hyperlink      Several researchers have done some work to combat dif-
from page i to page j, or 0 otherwise. Suppose we use I(i) to      ferent kind of Web spam. Fetterly et al. propose using sta-
represent the in-degree of node i and O(i) as the out-degree       tistical analysis to detect spam [7]. Acharya et al. [2] first
of node i, the definition of the transition matrix T is:           publicly propose using historical data to identify link spam
                                                                   pages. Wu and Davison [26] proposed using the intersection
                                                                   of the incoming and outgoing link sets plus a propagation
                    T [i, j] = M [j, i]/O(j)                (1)
                                                                   step to detect link farms. Mishne et al. [20] used a language
and the definition of the reverse transition matrix R is:          model to detect comment spam. Drost and Scheffer [6] pro-
                                                                   posed using a machine learning method to detect link spam.
                    R[i, j] = M [i, j]/I(j)                 (2)    Recently, Fetterly et al. [8] describe methods to detect a spe-
2.2 TrustRank and BadRank                                          cial kind of spam that provides pages by stitching together
                                                                   sentences from a repository.
   Gyöngyi et al. [13] introduce TrustRank. It is based on           Benczur et al. proposed SpamRank in [4]. For each page,
the idea that good sites seldom point to spam sites and            they check the PageRank distribution of all its incoming
people trust these good sites. This trust can be propagated        links. If the distribution doesn’t follow a normal pattern,
through the link structure on the Web. So, a list of highly        the page will be penalized and used as seed page. They also
trustworthy sites are selected to form the seed set and each       adopt the idea that spam values are propagated backward
of these sites is assigned a non-zero initial trust score, while   and finally spam pages will have high SpamRank values.
all the other sites on the Web have initial values of 0. Then a    Compared to SpamRank, we use labeled spam pages as our
biased PageRank algorithm is used to propagate these initial       seed set.
trust scores to their outgoing sites. After convergence, good         In prior work, we [27] pointed out that TrustRank has
sites will get a decent trust score, while spam sites are likely   a bias towards better represented communities in the seed
to get lower trust scores. The formula of TrustRank is:            set. In order to neutralize this bias, we proposed “Topical
                 t = (1 − α) × T × t + α × s                (3)    TrustRank”, which uses topics to partition the seed set and
                                                                   different mechanisms to combine trust scores from each par-
where t is the TrustRank score vector, α is the jump prob-         tition. We showed that this algorithm can perform better
ability, T is the transition matrix and s is the normalized        than TrustRank in reducing the number of highly ranked
trust score vector for the seed set. Before calculation, t is      spam sites. Compared with that paper, we do not consider
initialized with the value of s. Gyöngyi et al. iterated the      partitions for the seed set here. Instead, we show that dif-
above equation 20 times with α set to 0.15.                        ferent mechanisms for propagating trust can also help to
   In many SEO discussion boards, participants discuss the         demote more top ranked spam sites. The methods proposed
latest ranking and spam-finding techniques employed by             in this paper can generate better performance than Topical
commercial search engines. One approach, called Bad-               TrustRank.
Rank1 , is believed by some to be used by a commercial                Guha et al. [11] study how to propagate trust scores
engine to combat link farms.2 BadRank is based on prop-            among a connected network of people. Different propagation
agating negative value among pages. The idea of BadRank            schemes for both trust score and distrust score are studied
1
 One description of BadRank can be found at [1].                   based on a network from a real social community website.
2                                                                  Compared with their ideas, our definition of distrust is not
 See,  for example http://www.webmasterworld.com
/forum3/20281-22-15.htm.                                           exactly same. Their goal is to predict whether two people
will show trust (or distrust) to the other, but our goal is to
use trust and distrust to demote Web spam, especially top
ranked spam pages or sites.
   Massa and Hayes [19] review several current proposals for
extending the link mechanism to incorporate extra semantic
information, primarily those that allow the authors of a web
page to describe their opinion on pages they link to. They
argue that any change to the hyperlink facility must be easily
understood by the ordinary users of the Web, but the more        Figure 1: A link on the Web can propagate both
expressive linking structure would produce a richer semantic     trust and distrust.
network from which more precise information can be mined.
They used a real world data set from Epinions.com as a
proxy for the Web with the analogy that web pages are            Page B, then trust is propagated from Page A to Page B,
Epinions users and links are trust and distrust statements.      while distrust is propagated from Page B to Page A.
They show that this additional link information would allow        We explore different techniques for the handling of prop-
the PageRank algorithm to identify highly trusted web sites.     agation of trust and distrust from the respective seed sets
   Ziegler and Lausen [28] introduce the Appleseed algo-         to other pages on the Web.
rithm, a proposal for local group trust computation. The
basic intuition of the approach is motivated by spreading        5. ALGORITHM DETAILS
activation strategies. The idea of spreading activation is the
propagation of energy in a network. Also, the edges between        In this section, we present details of our ideas on propa-
the nodes are weighted based on the type of the edges. This      gating trust and distrust among web pages.
idea of energy flow is tailored for trust propagation. In con-
trast, our algorithm doesn’t consider a weighted graph.
                                                                 5.1 Propagating Trust
   Gray et al. [9] proposed a trust-based security framework        TrustRank propagates trust among web pages in the same
for ad hoc networks. The trust value among two nodes con-        manner as the PageRank algorithm propagates authority
nected by a path is the average of the weighted sum of trust     among web pages. The basic idea is that during each iter-
values of all nodes in the path. No experimental results are     ation, a parent’s trust score is divided by the number of its
shown.                                                           outgoing links and each of its children gets an equal share.
                                                                 Then a child’s overall trust score is the sum of the shares
                                                                 from all its parents.
4.   MOTIVATION                                                     Two key steps in the technique described above may be
   The original TrustRank paper proposed that trust should       explored. One is, for each parent, how to divide its score
be reduced as we move further and further away from the          amongst its children; we name this the “splitting” step. The
seed set of trusted pages. To achieve this attenuation of        other is, for each child, how to calculate the overall scores
trust, the authors propose two techniques, trust dampening       given the shares from all its parents; we name this the “ac-
and trust splitting. With trust dampening, a page gets the       cumulation” step.
trust score of its parent page dampened by a factor less than       For the splitting step, we study three choices:
1. With trust splitting, a parent’s trust score is equally
divided amongst its children. A child’s overall trust score is      • Equal Splitting: a node i with O(i) outgoing links
given by the sum of the shares of the trust scores obtained           and trust score T R(i) will give d × TO(i)
                                                                                                            R(i)
                                                                                                                 to each child.
from its parents.                                                     d is a constant with 0 < d < 1;
   In the case of trust splitting, we raise a question: Given
two equally trusted friends, why should the recommenda-             • Constant Splitting: a node i with trust score T R(i)
tions made by one friend be weighted less than the other,             will give d × T R(i) to each child;
simply because the first made more recommendations? A
similar argument has been made by Guha [10].                        • Logarithm Splitting: a node i with O(i) outgoing
                                                                                                                    T R(i)
   It is observed that a spam page often points to other spam         links and trust score T R(i) will give d × log(1+O(i)) to
pages for the purposes of boosting their PageRank value               each child.
and manipulating search engine results [26]. Motivated by
the idea of trust propagation, we believe that propagating          We term d to be the decay factor, which determines how
distrust given a labeled spam seed set, will help to penalize    much of the parents’ score is propagated to its children. In
other spam pages.                                                fact, if d equals 1, then the above “Equal Splitting” is the
   Hence, given a set of labeled spam seed set, we can prop-     same as the method used in TrustRank. As discussed in
agate distrust from this set to the pages that point to mem-     the Section 4, why should equally trusted pages propagate
bers of this set. The idea is that a page pointing to a spam     different trust scores just because they have different number
page is likely to be spam itself. But sometimes, good pages      of children? With “Constant Splitting”, each parent will
may unintentionally point to spam pages. In this case, these     give a constant portion of its trust value to all of its children
pages are penalized for not being careful with regard to cre-    irrespective of the number of its children. Thus for a child,
ating or maintaining links (as suggested by [3]).                if two of its parents have identical trust values but different
   In doing so, each page on the Web is assigned two scores,     number of children, then the child will get the same value
a trust score and a distrust score. In the combined model,       from both of these parents. The third choice, “Logarithm
a link on the Web can then propagate these two scores. As        Splitting” does not eliminate the effect of the number of
shown in Figure 1, suppose there is a link from Page A to        children that a page has but can decrease it.
  Since “Equal Splitting” is the choice already being em-          2 and r is the normalized distrust score vector for the dis-
ployed in TrustRank, we will focus on “Constant Splitting”         trusted seed set. Before calculation, n is initialized with the
and “Logarithm Splitting” in our experiments.                      value of r.
  For the accumulation step, we study three choices.                 However, as discussed in Section 5.1, the propagation
                                                                   mechanism of TrustRank may not be optimal to propagate
   • Simple Summation: Sum the trust values from each              trust or distrust for the purpose of demoting spam pages.
     parent.                                                       We propose that the same choices to propagate trust, dis-
   • Maximum Share: Use the maximum of the trust                   cussed in Section 5.1, can be taken to propagate distrust.
     values sent by the parents.                                     Suppose we use DIS T R(i) to represent the distrust score
                                                                   for node i. For the splitting step, we have three choices:
   • Maximum Parent: Sum the trust values in such a
     way as to never exceed the trust score of the most-              • Equal Splitting: a node i with I(i) incoming links
     trusted parent.                                                    and DIS T R(i) will give dD × DISI(i)
                                                                                                          T R(i)
                                                                                                                 to each par-
                                                                        ent. where 0 < dD < 1;
  The first choice is the same as in PageRank and
TrustRank; using the sum of trust scores from all parents             • Constant Splitting: a node i with DIS T R(i) will
as the child’s trust score. For “Maximum Share”, the max-               give dD × DIS T R(i) to each parent;
imum value among the trust values inherited from all the
parents is used as the child’s trust score. For “Maximum              • Logarithm Splitting: a node i with I(i) incoming
                                                                                                            DIS T R(i)
Parent”, first the sum of trust values from each parent is              links and DIS T R(i) will give dD × log(1+I(i)) to each
calculated and this sum is compared with the largest trust              parent.
score among each of its parents, the smaller of these two
values is used as the child’s trust score.                            The “Equal Splitting” choice is quite similar to that in
  By using the above choices, the equation for calculating         the case of trust propagation in TrustRank. Intuitively, this
trust score is different from Equation 3. For example, if          kind of splitting may raise problems when the purpose of
using “Constant Splitting” and “Simple Summation”, the             propagating distrust is to demote spam. For a simple ex-
equation will become:                                              ample, by “Equal Splitting”, a spam site with more parents
                                                                   will propagate smaller distrust to its parents, while spam
             t = (1 − α) × d × M T × t + α × s              (4)    sites with fewer parents will propagate bigger distrust to its
where t is the trust score vector, α is the jump probability, d    parents. Obviously, this policy supports popular spam sites
is the constant discussed in the above splitting choices, M is     and this is clearly not desirable for the purpose of demoting
the web matrix shown in Section 2.1 and s is the normalized        spam. In comparison, “Constant Splitting” and “Logarithm
trust score vector for the seed set.                               Splitting” present better choices.
                                                                      For the accumulation step, we also have three choices:
5.2 Propagating Distrust                                              • Simple Summation: Sum the distrust values from
   The trust score of a page is an indication of how trustwor-          each child.
thy the page is on the Web. In the case of web spam, the
trust score can be seen as a measure of the likelihood that           • Maximum Share: Use the maximum of the distrust
a page is not a spam page.                                              values sent by the children;
   Similarly, we introduce the concept of distrust to penal-
                                                                      • Maximum Parent: Sum the distrust values in such
ize the pages that point to untrustworthy pages. Now, it is
                                                                        a way as to never exceed the distrust score of the most-
possible that pages unintentionally point to spam pages. In
                                                                        distrusted child.
these cases, we argue that the (otherwise good) page should
be penalized to some extent for not being careful in its link-       Different choices will employ different equations during
ing behavior.                                                      the calculation. For example, if using “Constant Splitting”
   Distrust propagation makes sense when spam sites are            and “Simple Summation”, the equation of calculating dis-
used as the distrusted seed set and distrust is propagated         trust score is:
from a child to its parent. So, based on this idea, one link can
                                                                                n = (1 − α) × dD × M × n + α × r              (6)
represent two propagation processes, i.e., the trust score is
propagated from the parent to the children while the distrust      where n is the distrust score vector, α is the jump prob-
score is propagated from the children to the parent.               ability, d is the constant discussed in the above splitting
   In this technique, some known spam pages are selected           choices, M is the web matrix shown in Section 2.1 and r is
as the distrusted seeds and assigned some initial distrust         the normalized distrust score vector for the distrusted seed
scores. During each iteration, the distrust score is propa-        set.
gated from children pages to parent pages iteratively. After
convergence, a higher distrust score indicates that this page      5.3 Combining Trust and Distrust
is more likely to be a spam page.                                     On propagating trust and distrust to the pages on the
   A direct method of calculating distrust score for each page     web, each page will be assigned two scores, a trust score
is to follow the same idea as TrustRank. The calculation can       and a distrust score. Then comes the question of combining
be represented by Equation 5.                                      them to generate a unified ranking of pages that is indicative
                                                                   of their trustworthiness.
                n = (1 − α) × R × n + α × r                 (5)
                                                                      Our goal of propagating trust and distrust is to demote
where n is the distrust score vector, α is the jump proba-         spam sites in the ranking. Since the trust score is an indi-
bility, R is the reverse transition matrix shown in Equation       cation of how unlikely it is that the page is a spam page,
while the distrust score is an indication of how likely it is
that the page is a spam page, a direct solution is to simply
calculate the difference of these two scores and use this value
to represent the overall trustworthiness of the Web page.
  Additionally, we may apply several methods for the com-
bination. For example, we may give different weights when
calculating the sum. Suppose we use T otal(i) to represent
the difference of trust and distrust score for page i. Then
we can apply the following formula:

          T otal(i) = η × T R(i) − β × DIS T R(i)          (7)

where η and β (0 < η < 1, 0 < β < 1) are two coefficients
to give different weights to trust and distrust scores in this    Figure 2: Number of spam sites within each bucket
formula.                                                          by PageRank and TrustRank.

6.   DATA SET
   The data set used in our experiments is courtesy of            TrustRank is shown in Figure 2. It is clear that TrustRank
search.ch search engine [25]. It is a 2003 crawl of pages that    is good at demoting spam sites compared to PageRank.
are mostly from the Switzerland domain. There are about              In this paper, we use the number of spam sites within the
20M pages within this data set and around 350K sites with         top 10 buckets as the metric for measuring the performance
the “.ch” domain. Since we were also provided with 3, 589 la-     of algorithms. This choice of choosing the top 10 buckets was
beled sites and domains applying different spam techniques,       arbitrary as in the case of [27]. The smaller the number of
we used the site graph for testing the ideas we propose in        spam sites in the top 10 buckets, the better the performance
this paper.                                                       of the algorithm in demoting spam sites from the top ranking
   In order to generate a trusted seed set, we extract all the    positions.
URLs listed within the search.ch topic directory [25] of 20          The results of this metric for the PageRank and
different topics, which is similar to the DMOZ directory but      TrustRank algorithms are shown in Table 1. These results
only lists pages primarily within the Switzerland domain.         will be used as the baseline results. We can see that PageR-
Since we use the site graph in our calculation and the topic      ank ranks 90 spam sites within the top ten buckets, while
directory listed only pages, we used a simple transfer policy:    TrustRank ranks only 58 spam sites.
if a site had a page listed in a certain topic directory, we
put the site into a trusted seed set. In doing so, we marked      7.1 Different Jump Probabilities
20, 005 unique sites to form the seed set.                           In TrustRank, the jump probability α in Equation 3 is
   For the generation of a distrusted seed set, we use the        usually assigned a value of 0.15. We measure the perfor-
labeled spam list which contains 3, 589 sites or domains. In      mance of TrustRank with different values of this jump fac-
our experiments, we use only a portion of this list as the        tor.
distrusted seed set with the rest being used to evaluate the         Since we use all the URLs listed in dir.search.ch as the
performance.                                                      trusted seed set, it is quite possible that some spam sites
                                                                  get included in this set too. On checking, we find that 35
                                                                  labeled spam sites are within the trusted seed set. It is
7.   EXPERIMENTS                                                  worthwhile to drop these spam sites from the seed set. We
   We test all the ideas we propose in Section 5 by using         run TrustRank again with different jump probabilities after
the search.ch data set. Since the goal of this paper is to        dropping these 35 labeled spam sites from the seed set.
investigate how different mechanisms of propagating trust            The results with both the original seed set and the cleaned
and distrust can help to demote top ranking spam sites, we        seed set are shown in Figure 3. We observe that larger jump
will focus on the ranking positions of the labeled 3, 589 spam    probabilities decrease the number of spam sites from top
sites.                                                            ranking positions. Since a larger jump probability means
   We first calculate the PageRank value for each site based      that smaller trust values are propagated from a parent to its
on the search.ch site graph. These sites are then ranked          children, the results show that for the purpose of demoting
in a descending order of their PageRank values. Based on          spam sites, in TrustRank, a better approach is of relatively
this ranking, we divide these sites among 20 buckets, with        little trust propagation. We also observe that the dropping
each bucket containing sites with the sum of their PageRank       of spam sites from the seed set results in fewer spam sites
values equal to 1/20th of the sum of the PageRank values          within the top ten buckets.
of all sites.
   We then calculate the TrustRank score for each site based
on the site graph, to generate a ranking of sites sorted in
the descending order of these scores. As in the case of the
                                                                               Algorithm     No. of Spam sites
TrustRank paper [13], we iterated 20 times during this cal-                                  in top 10 buckets
culation. We then divide these sites among 20 buckets such                      PageRank                    90
that each TrustRank bucket has an identical number of sites                     TrustRank                   58
to the corresponding PageRank bucket. The distribution of
the 3,589 spam sites in the 20 buckets by PageRank and              Table 1: Baseline results for search.ch data set.
                          Algorithm                    Constant                         Logarithm
                                                       Splitting                         Splitting
                          d value           d=0.1    d=0.3 d=0.7      d=0.9   d=0.1    d=0.3 d=0.7      d=0.9
                      Simple summation        364      364     364      364     364      364     364      364
                       Maximum Share           34       34       34      34      13       12       20      18
                      Maximum Parent           27       32       33      33     372       27       29      32

Table 2: Results for the combination of different methods of propagating trust. Experiments are done with
different values for d. Only trust score is used in this table.


7.2 Different Trust Propagation Methods                               these choices. In Table 2, our best result is 12 spam sites in
   As introduced in Section 5, we explore two choices in the          the top ten buckets, which is a much greater improvement
splitting step: “Constant Splitting” (d × T R(i)) and “Loga-          when compared to the baseline results of 58 spam sites in
                          T R(i)
rithm Splitting” (d log(1+O(i))   ), while we have three choices      Table 1.
                                                                         It is worth mentioning that by introducing the above ideas
in the accumulation step: “Simple Summation”, “Maximum
                                                                      of splitting and accumulating trust, we notice, in some cases,
Share” and “Maximum Parent”.
                                                                      long ties in the trust scores. For example, the top several
   The number of different combinations of the above choices
                                                                      thousand of sites may have identical trust scores. This is
is six. For each combination we try using different values of
                                                                      different from the values by PageRank or TrustRank. We
d ranging from 0.1 to 0.9. The results of these six combina-
                                                                      think this tie is still reasonable as long as few spam sites
tions with different values of d are shown in Table 2.
                                                                      are in the ties close to the top. Since there are 3, 823 sites
   From the results in Table 2, we can tell that “Simple Sum-
                                                                      in the top ten buckets by PageRank, we consider the ties
mation” always generates the worst performance, which is
                                                                      that have rankings around this position still within top ten
worse than TrustRank and even PageRank. A lot of spam
                                                                      buckets, thus all the spam sites before or within this tie will
sites are raised in the ranking. Intuitively, this “Simple Sum-
                                                                      still be counted within top ten buckets.
mation” will boost the rankings of sites with multiple par-
                                                                         Actually, we find that for most cases, these ties can help
ents. In general, it is likely a spam site that has a large
                                                                      to demote more spam sites. But some small d may cause
number of incoming links will be able to accumulate a fairly
                                                                      a strong tie with more than 10, 000 sites and thus raise the
large value of trust. Hence, spam sites may be benefited by
                                                                      number of spam sites within top ten buckets. One example is
this “Simple Summation” method.
                                                                      that there are 372 spam sites within top ten buckets when
   We also observe that, in most cases, both “Maximum
                                                                      combining “Maximum Parent” and “Logarithm Splitting”
Share” and “Maximum Parent” methods generate much bet-
                                                                      with d set to 0.1.
ter performance than TrustRank and the “Simple Summa-
tion” method. With regard to the splitting methods, we
observe that in most cases, “Logarithm Splitting” performs            7.3 Introducing Distrust
better than “Constant Splitting”.                                        Trust can be propagated from trusted seed set to the chil-
   The results clearly demonstrate that for the purpose of            dren pages iteratively. Similarly, distrust can be propagated
demoting web spam, propagating trust based on the idea of             from a distrusted seed set to the parent pages iteratively.
“Equal Splitting” and “Simple Summation” which is used                While our distrusted seed set was provided to us, in general
by TrustRank, is not the optimal solution.                            a search engine will maintain a spamming blacklist, using
   Gyöngyi et al. [13] mentioned that there are different pos-       both manual and automatic methods (perhaps, e.g., [7, 8,
sibilities for splitting trust scores; the reason that they chose     26, 21]).
the method similar to PageRank is that only minor changes                In order to investigate whether introducing distrust can
are needed for calculating TrustRank by using existing effi-          help to improve the performance in demoting spam sites,
cient methods for computing PageRank [18]. We argue that              we randomly select a portion of labeled spam sites as the
if different choices of splitting and accumulating trust can          distrusted seed set and calculate distrust values for each
greatly demote spam sites, it is worthwhile to implement              site. The ranking positions of the remaining spam sites will
                                                                      be used to evaluate the performance.

                                                                      7.3.1 Basic Propagation of Distrust
                                                                        As described, there are several different choices of prop-
                                                                      agating distrust among web pages, we first use the method
                                                                      shown in Equation 5.
                                                                        We randomly select 200 spam sites from the 3, 589 labeled
                                                                      spam sites as the distrusted seed set to calculate distrust
                                                                      score. Then we calculate the sum of this distrust score and
                                                                      the trust score generated by TrustRank. By using the sum
                                                                      for ranking, we count the number of spam sites (m) in the
                                                                      top ten buckets as in the case of previous experiments.
                                                                        But we can not compare the above number m directly with
                                                                      the results shown in Table 1. The reason is that some top
                                                                      ranked spam sites may have been selected in the distrusted
Figure 3: Number of top ranked spam sites with                        seed and they will get demoted as an effect of their selection,
different jump probabilities for TrustRank.                           not as an effect of our algorithm. Thus, in order to be fair,
                   Algorithm                       Constant                               Logarithm
                                                   Splitting                               Splitting
                   dD value          dD =0.1   dD =0.3 dD =0.7     dD =0.9   dD =0.1   dD =0.3 dD =0.7      dD =0.9
               Simple Summation           53        53       55         55        57        53       53          53
                Maximum Share             53        53       53         53        59        53       52          52
                Maximum Parent            53        53       53         53        57        53       53          53

Table 3: Results for different methods of propagating distrust. The ranking is determined by the combination
of distrust score and TrustRank.


we need to count the number of spam sites (n) that are              each different combination, we can see how different choices
in the top ten buckets by TrustRank which are also in the           of propagating distrust can affect the overall performance
distrusted seed set. Only when the sum of m and n is smaller        and thus we can tell which choice is better for propagating
than 58, which is listed in Table 1, we can claim that the          distrust. For simplicity, we only choose 200 spam sites to
performance is better than that of TrustRank.                       generate the distrusted seed set once. Results of six different
   Also the random selection of distrusted seeds may still          combinations with different d values are shown in Table 3.
not be representative of the 3, 589 spam sites. In order to           From the results in Table 3, we can see that some choices
neutralize this bias, we repeated the above seed selection          can help to demote more spam sites than others. For exam-
five times for calculating distrust scores. Then we use the         ple, the combination of “Logarithm Splitting” and “Maxi-
average results as the final results for the distrusted seed set    mum Share” with d set to 0.7 or 0.9.
with 200 seeds. On average, there are 54 spam sites still in
the top ten buckets and 4 spam sites are in the distrusted          7.4 Combining Trust and Distrust Values
seed set. The sum of 54 and 4 equals the number of spam               In the above experiments, we use the sum of the trust and
sites, which is 58, in top ten TrustRank buckets; this shows        distrust values as the final value for ranking. As discussed
that using TrustRank’s mechanism (Equation 5) to prop-              in Section 5, we may use different weights to combine trust
agate distrust is not helpful in demoting top ranked spam           and distrust values.
sites.                                                                In practice, we did the following experiment to show how
   In order to verify whether introducing more distrusted           the combination of trust and distrust values can affect per-
seeds with this basic distrust propagation is useful, we gen-       formance.
erated distrusted seed sets of sizes ranging from 200 to 1, 000.
                                                                       • To calculate trust score, we select the choice that can
Similarly, for each seed set size, we repeated this generation
                                                                         generate best performance in Table 2, i.e., using “Max-
five times. The average results are shown in Table 4. The
                                                                         imum Share” for accumulation and “Logarithm Split-
results show that no matter how many seeds are selected
                                                                         ting” for splitting while with d set to 0.3.
for the distrusted seed set, the sum of the second element
and third element in Table 4 is always around 60. Since this           • To calculate distrust score, we select the choice that
sum is quite close to the 58 spam sites in Table 1, we believe           can generate best performance in Table 3, i.e., using
that using the same mechanism as TrustRank to propagate                  “Maximum Share” for accumulation and “Logarithm
distrust can not help to demote top ranked spam sites.                   Splitting” for splitting with dD set to 0.9.

7.3.2 Different Choices of Propagating Distrust                        • For combining trust and distrust values, we follow the
  Since we have shown that propagating distrust by using                 Equation 7, with β equals 1 − η. Test with different
the TrustRank mechanism may not be helpful, the next obvi-               values of η.
ous step is to investigate whether the choices of propagating          • We test with different numbers of distrusted seeds.
trust can also be applied for propagating distrust in order
to demote top ranked spam sites.                                       The results for these experiments are shown in Figure 4.
  Similar to the methods used for generating results in Ta-         There are three lines in the figure. Each represents the re-
ble 2, we applied the six combinations of different choices         sults by using 200, 400, 600 spam sites as distrusted seed
for the splitting step and accumulation steps to the propa-         respectively. From these results, we can tell that an increase
gation of distrust. In order to evaluate the performance, for       in the size of the distrusted seed set will result in an increase
each combination, we calculate the sum of the distrust value        in performance.
and TrustRank value for each site. Then this sum is used               Compared with the baseline results in Figure 1, more than
for ranking. Since the TrustRank value is unchanged for             80% of spam sites disappear from the top ten buckets. This
                                                                    verifies our hypothesis that using different trust propaga-
                                                                    tion methods together with distrust propagation can help
   Number of      No. of Spam sites         No. of Spam             to demote spam sites effectively.
     seeds        in top 10 buckets      sites in seed set             Actually, the results in Figure 4 are not our best results.
      200                        54                      4          During our experiments, we find that by using “Constant
      400                        55                      5
      600                        49                     12          Splitting” and “Maximum Parent” for trust propagation,
      800                        48                     13          “Logarithm Splitting” and “Maximum Share” for distrust
      1000                       45                     16          propagation with d, dD and η as 0.1, we can remove all the
                                                                    spam sites from the top ten buckets. We believe that there
Table 4: Results when using same mechanism as                       may be several other combinations that generate optimal
the propagation of trust in TrustRank to propagate                  results. However, due to resource constraints, we have not
distrust.                                                           enumerated every such combination.
                                                              Algorithm                              Constant                            Logarithm
                                                                                                      Splitting                           Splitting
                                                             d value                    d=0.1      d=0.3 d=0.7       d=0.9    d=0.1    d=0.3 d=0.7       d=0.9
                                                          Maximum Share                  77.71      77.73    77.74    77.74    77.19    77.72    77.73    77.73
                                                          Maximum Parent                 77.52      77.71    77.73    77.74    76.93    77.60    77.71    77.72

                                                  Table 5: Percentage of sites affected by combining different ideas to propagate trust.


7.5 Impact of Trust Propagation                                                                                      sites. In the future, we intend to study how the propagation
   Since the trust or distrust scores are propagated from lim-                                                       of trust or distrust can help raise high quality sites in the
ited number of seed pages, it is quite possible that only a                                                          ranking positions.
part of the whole web graph can be touched by this prop-                                                                We show that mechanisms such as “Logarithm Splitting”
agation. In other words, some pages will have zero values                                                            or “Maximum Share” for propagating trust and distrust can
after the algorithm is employed. We are not in a position                                                            do better than TrustRank in demoting top ranked spam
to make trust judgments with regard to these pages. It is                                                            sites. We intend to explore other choices that can help im-
highly desirable to have a well performing algorithm that                                                            prove the performance.
with a limited seed set enables us to make trust judgments                                                              In our paper, we combine trust and distrust scores only
about a large fraction of web pages.                                                                                 at the final step. It is possible that this combination can be
   Intuitively, different values for α in Equation 3 or d in                                                         done during the calculation of trust and distrust scores. We
“Constant Splitting” and “Logarithm Splitting” will de-                                                              aim to study the different choices that may be taken into
termine how far trust and distrust are propagated. In                                                                this combination.
TrustRank, smaller α means that more trust will be prop-                                                                Ranking algorithms such as PageRank are used by sev-
agated to children pages in each iteration; thus more pages                                                          eral popular search engines for ranking Web pages to given
may have nonzero value after 20 iterations. In order to show                                                         queries. The concept of authority and trustworthiness are
this, for the same experiment shown in Figure 3, we check                                                            not identical—PageRank gives an authority value for each
what percentage of sites have nonzero values according to                                                            page, while propagating trust from seed sets tells how trust-
different values of α. The results are shown in Table 6.                                                             worthy a page on the web is as a source of ranking informa-
   If more sites have nonzero values by using different                                                              tion. In this paper we have only explored the value of trust
choices, then we can claim that the trust scores are prop-                                                           propagation for spam demotion; ultimately the goal, how-
agated further by these choices. Since the results obtained                                                          ever, is to improve the quality of search results. We plan to
by using “Maximum Share” and “Maximum Parent” in Ta-                                                                 investigate combinations of trust and distrust with author-
ble 2 are better than TrustRank, we check the percentage of                                                          ity to measure the effect on search results ranking (quality
pages with nonzero values for these choices. The results are                                                         of results).
shown in Table 5.                                                                                                       All of our experiments are based on the search.ch data
   The results in Table 5 show larger numbers when com-                                                              set. This data set may have special characteristics different
pared to the results in Table 6. This demonstrates that                                                              from the whole web. We need to test the ideas presented
our choices can affect more pages as well as generate better                                                         here on a larger data set, such as the WebBase [16] data set,
performance in the demotion of top ranking spam sites.                                                               in the future.


8.                                     DISCUSSION                                                                    9. CONCLUSION
   In this paper, we investigate the possibility of using dif-                                                          In this paper, we show that propagating trust based on
ferent choices to propagate trust and distrust for ranking                                                           the number of outgoing links is not optimal in demoting top
Web pages or sites. We only focus on the demotion of spam                                                            ranked spam sites. Instead, we demonstrate that using dif-
                                                                                                                     ferent choices such as “Constant Splitting” or “Logarithm
                                                                                                                     Splitting” in the splitting step and “Maximum Share” or
 No. of spam sites in top 10 buckets




                                       14                                                         200
                                                                                                                     “Maximum Parent” in the accumulation step for propagat-
                                                                                                  400                ing trust can help to demote top ranked spam sites as well
                                       12                                                         600
                                                                                                                     as increase the range of trust propagation.
                                       10
                                       8
                                                                                                                                    Jump           Percentage of sites
                                       6                                                                                          Probability     with nonzero values
                                       4                                                                                              0.9                        59.28
                                                                                                                                      0.8                        66.72
                                       2                                                                                              0.7                        70.52
                                            0.1   0.2   0.3    0.4       0.5      0.6       0.7         0.8   0.9                     0.6                        72.79
                                                                     Value of eta                                                     0.5                        74.07
                                                                                                                                      0.4                        74.99
                                                                                                                                      0.3                        75.56
                                                                                                                                      0.2                        75.91
Figure 4: Number of top ranked spam sites when                                                                                        0.1                        76.13
ranking by the combination of trust score and dis-
trust score. Different η and different number of                                                                     Table 6: Percentage of sites affected when using dif-
seeds (200, 400, 600) are used.                                                                                      ferent jump probabilities.
  Additionally, by introducing the concept of propagating        [13] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen.
distrust among Web pages or sites, we show that the per-              Combating web spam with TrustRank. In Proceedings
formance of demoting top ranked spam sites can be further             of the 30th International Conference on Very Large
improved.                                                             Data Bases (VLDB), pages 271–279, Toronto,
                                                                      Canada, Sept. 2004.
Acknowledgments                                                  [14] T. Haveliwala. Topic-sensitive PageRank. In
This work was supported in part by the National Science               Proceedings of the Eleventh International World Wide
Foundation under award IIS-0328825. We are grateful to                Web Conference, pages 517–526, Honolulu, Hawaii,
Urban Müller for helpful discussions and for providing access        May 2002.
to the search.ch dataset.                                        [15] M. R. Henzinger, R. Motwani, and C. Silverstein.
                                                                      Challenges in web search engines. SIGIR Forum,
                                                                      36(2):11–22, Fall 2002.
10. REFERENCES                                                   [16] J. Hirai, S. Raghavan, H. Garcia-Molina, and
 [1] Pr0 - google’s pagerank 0, 2002.                                 A. Paepcke. WebBase: a repository of Web pages.
     http://pr.efactory.de/e-pr0.shtml.                               Computer Networks, 33(1–6):277–293, 2000.
 [2] A. Acharya, M. Cutts, J. Dean, P. Haahr,                    [17] G. Jeh and J. Widom. Scaling personalized web
     M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger,               search. In Proceedings of the Twelfth International
     O. Sercinoglu, and S. Tong. Information retrieval                World Wide Web Conference, pages 271–279,
     based on historical data, Mar. 31 2005. US Patent                Budapest, Hungary, May 2003.
     Application number 20050071741.                             [18] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub.
 [3] Z. Bar-Yossef, A. Z. Broder, R. Kumar, and                       Extrapolation methods for accelerating PageRank
     A. Tomkins. Sic transit gloria telae: Towards an                 computations. In Proceedings of the Twelfth
     understading of the web’s decay. In Proceedings of the           International World Wide Web Conference, 2003.
     Thirteenth International World Wide Web                     [19] P. Massa and C. Hayes. Page-rerank: using trusted
     Conference, New York, May 2004.                                  links to re-rank authority. In Proceedings of Web
 [4] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher.             Intelligence Conference, France, Sept. 2005.
     SpamRank - fully automatic link spam detection. In          [20] G. Mishne, D. Carmel, and R. Lempel. Blocking blog
     Proceedings of the First International Workshop on               spam with language model disagreement. In
     Adversarial Information Retrieval on the Web                     Proceedings of the First International Workshop on
     (AIRWeb), 2005.                                                  Adversarial Information Retrieval on the Web
 [5] G. Collins. Latest search engine spam techniques,                (AIRWeb), 2005.
     Aug. 2004. Online at                                        [21] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.
     http://www.sitepoint.com/article/search-engine-                  Detecting spam web pages through content analysis.
     spam-techniques.                                                 In Proceedings of the 15th International Conference on
 [6] I. Drost and T. Scheffer. Thwarting the nigritude                the World Wide Web, Edinburgh, Scotland, May 2006.
     ultramarine: Learning to identify link spam. In             [22] Open Directory Project, 2005. http://dmoz.org/.
     Proceedings of European Conference on Machine               [23] L. Page, S. Brin, R. Motwani, and T. Winograd. The
     Learning, pages 96–107, Oct. 2005.                               PageRank citation ranking: Bringing order to the
 [7] D. Fetterly, M. Manasse, and M. Najork. Spam, damn               web. Technical report, Stanford Digital Library
     spam, and statistics: Using statistical analysis to              Technologies Project, 1998.
     locate spam web pages. In Proceedings of WebDB,             [24] A. Perkins. White paper: The classification of search
     pages 1–6, June 2004.                                            engine spam, Sept. 2001. Online at
 [8] D. Fetterly, M. Manasse, and M. Najork. Detecting                http://www.silverdisc.co.uk/articles/spam-
     phrase-level duplication on the world wide web. In               classification/.
     Proceedings of the 28th Annual International ACM            [25] Räber Information Management GmbH. The Swiss
     SIGIR Conference on Research & Development in                    search engine, 2006. http://www.search.ch/.
     Information Retrieval, pages 170–177, Salvador,             [26] B. Wu and B. D. Davison. Identifying link farm spam
     Brazil, August 2005.                                             pages. In Proceedings of the 14th International World
 [9] E. Gray, J. Seigneur, Y. Chen, and C. Jensen. Trust              Wide Web Conference, pages 820–829, Chiba, Japan,
     propagation in small worlds. In Proceedings of the               May 2005.
     First International Conference on Trust Management,         [27] B. Wu, V. Goel, and B. D. Davison. Topical
     2003.                                                            TrustRank: Using topicality to combat web spam. In
[10] R. Guha. Open rating systems. Technical report,                  Proceedings of the 15th International World Wide
     Stanford University, 2003.                                       Web Conference, Edinburgh, Scotland, May 2006.
[11] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins.             [28] C.-N. Ziegler and G. Lausen. Spreading activation
     Propagation of trust and distrust. In Proceedings of             models for trust propagation. In Proceedings of the
     the 13th International World Wide Web Conference,                IEEE International Conference on e-Technology,
     pages 403–412, New York City, May 2004.                          e-Commerce, and e-Service, Taipei, Taiwan, March
[12] Z. Gyöngyi and H. Garcia-Molina. Web spam                       2004. IEEE Computer Society Press.
     taxonomy. In First International Workshop on
     Adversarial Information Retrieval on the Web
     (AIRWeb), Chiba, Japan, 2005.