=Paper= {{Paper |id=None |storemode=property |title=Detecting Content Spam on the Web through Text Diversity Analysis |pdfUrl=https://ceur-ws.org/Vol-735/paper3.pdf |volume=Vol-735 |dblpUrl=https://dblp.org/rec/conf/syrcodis/PavlovD11 }} ==Detecting Content Spam on the Web through Text Diversity Analysis== https://ceur-ws.org/Vol-735/paper3.pdf
 Detecting Content Spam on the Web through Text Diversity
                         Analysis
                   © Anton Pavlov                                                © Boris Dobrov
 M.V. Lomonosov Moscow State University,                             M.V. Lomonosov Moscow State
 Faculty of Computational Mathematics and                          University, Research Computer Center
                Cybernetics                                                 dobroff@mail.cir.ru
            pavvloff@yandex.ru


                       Abstract                               results. Section 5 is dedicated to future work and
                                                              conclusions.
    Web spam is considered to be one of the
    greatest threats to modern search engines.                1.1 Related Work
    Spammers use a wide range of content
                                                                  Many spam detection techniques have been
    generation techniques known as content spam
                                                              proposed in recent years during the Web Spam
    to fill search results with low quality pages.
                                                              Challenge [20]. Some content features we used were
    We argue that content spam must be tackled
                                                              proposed by Ntoulas et. al. [17]. This work showed that
    using a wide range of content quality features.
                                                              compressibility of text and some HTML-related
    In this paper we propose a set of content
                                                              characteristics distinguish content spam from normal
    diversity features based on frequency rank
                                                              pages. A large amount of linguistic features were
    distributions for terms and topics. We combine
                                                              explored in a work by Piskorski et. al. [19]. Latent
    them with a wide range of other content
                                                              Dirichlet Allocation [5] is known to perform well in text
    features to produce a content spam classifier
                                                              classification tasks. Biro et al. did a lot of research on
    that outperforms existing results.
                                                              modifying the LDA model to suit Web spam detection.
                                                              They developed the multi-corpus LDA [3] and linked
1 Introduction                                                LDA [4] models. The former builds separate LDA
    Web spam or spamdexing is defined as “any                 models for spam and ham and uses topic weights as
deliberate action that is meant to trigger an unjustifiably   classification features. The later incorporates the link
favorable relevance or importance for some Web page,          data into LDA model to spam classification.
considering the page’s true value” [15]. Studies show             Web spam is also aimed at web graph features used
that at least 20 percent of hosts on the Web are spam         by search engines so many researchers focused on
[7]. Web spam is widely acknowledged as one of the            detecting link spam. Techniques like TrustRank [14]
most important challenges to web search engines [16].         minimize the impact spam pages in ranking. Much
    There is a wide range of spamming techniques              attention has been focused on fighting link farms – web
usually aimed at different algorithms used in search          graph structures designed to accumulate PageRank and
engines. This article is dedicated to content spam            affect other pages rankings [21]. Finally more and more
detection algorithms. Content spamming or term                researchers combine the link and content data to
spamming refers to “techniques that tailor the contents       improve classification results [1, 4]. In this work we
of text fields in order to make spam pages relevant for       didn’t use any link spam detection techniques as we
some queries” [15]. We argue that content spam can be         focused on content spam.
detected using a combination of text quality features             Fetterly et. al. proposed using duplicates analysis to
that cover multiple characteristics of natural texts. In      detect web spam [10]. They measured phrase-level
this work we introduce several novel features based on        duplication of content across the web and found that
frequency rank distributions for terms and topics that        spam tends to have greater number of popular shingles
substantially improve content spam classification.            per document.
    In Section 2 we provide basic assumptions behind
our research. In Section 3 we describe the content spam       2 Understanding Content Spam
detection framework. Section 4 contains evaluation
                                                                  We believe that tackling Web spam is impossible
                                                              without understanding how it works. Content spamming
Proceedings of the Spring Young Researcher's                  is aimed at text relevance algorithms, such as BM25
Colloquium On Database and Information Systems                and tf.idf [15]. These algorithms are particularly
SYRCoDIS, Moscow, Russia, 2011                                vulnerable to content spam as there is a strong
correlation between document relevance and amount of           and analysis of these features are provided in Section
query terms found in the text.                                 3.2. Topical classification and topical diversity features
    Content spam is often used in doorways – pages and         based on LDA statistical model are presented in Section
sites designed specifically to attract and redirect traffic.   3.3.
Doorways are only efficient if they reach the top of               All statistics on described features were collected on
search results. Spammers prefer to generate thousands          WEBSPAM-UK2007 dataset [22]. The spam
of doorway pages, each optimized for a specific query,         prevalence histograms provided in this section were
to maximize amount of traffic collected.                       generated on the set of 3995 labeled hosts from the
    This leads to several requirements that content spam       training part of the dataset.
must satisfy to be efficient:
     It must be generated in thousands of pages;              3.1 Statistical Features
       Each page must maximize text relevance for                 The benefit of using wide range of linguistic
        some search query.                                     features has been shown before by Piskorski et. al [19].
   Thus spammers have very little options of                   These features are commonly used in stylometry and
generating content for their doorways:                         authorship identification. We used POS tagger to tag
    They may generate content automatically;                  every word in the dataset. We also substantially
                                                               elaborated linguistic features by implementing a set of
       They may duplicate texts from other web sites;         style-related diversity features that are described in
        Or they may use a combination of both                 Section 3.2.
         techniques.                                               In order to extract maximum information from POS
    Automatic text generation is a difficult task that         tagging we calculated ratios of different parts of speech
does not have a satisfactory solution yet. Natural texts       in words and ratios of different grammatical categories:
have multiple levels of consistency that are extremely               POS ratios:
hard to emulate all at once. In text generation tasks such                   o Adjectives;
as automatic document summarization researchers                              o Nouns;
distinguish multiple qualities of natural texts.                             o Pronouns;
Experiments show that even specialized text generation                       o Verbs;
algorithms score low in most of these measures [9].                          o Numerals;
    The levels of consistency include local coherence,                       o Particles;
style and authorship consistency, topical consistency,                       o Conjunctions;
logical structure of the document etc. In this setting the                   o Articles;
uniqueness of text is just another type of constraint that           Grammatical categories:
is inherent for natural texts. Our approach is based on                      o Number;
controlling as much natural text constraints as possible,                    o Tense;
making it harder for spammers to conceal low quality                         o Aspect;
content.                                                                     o Mood.
    There is a wide range of text generation techniques            Combinations of different parts of speech and
that generate locally coherent yet unreadable texts.           categories resulted in 82 distinct grammatical forms.
Techniques like Markovian text generators are often            We calculated the ratio of each grammatical form:
used by web spammers to generate unique texts in great
                                                                                             # form _ occurences
numbers. We were especially interested in designing                        Ratio ( form)                          .
text quality analyzer that would detect such advanced                                              # words
types of web spam.                                                 We also measured ratios of grammatical categories
                                                               for specific parts of speech, e.g. ratio of verbs in past
                                                               tense compared to all verbs:
                                                                                                # verb _ in _ past _ tense
3 Content Spam Detection Framework                                Ratio verbs ( past _ tense)                             .
                                                                                                          # verbs
    Our work was based on assumption that spammers                 As a result, we used a total of 145 POS-related
cannot emulate all aspects of natural texts. Our goal was      statistical features.
to address as many domains of consistency as possible,             Another domain we took features from was text
by using various features. We measured multiple                readability research. Readability metrics were
aspects of text quality and used supervised learning to        developed for military and educational purposes to
combine them into content spam classifier. Despite a           measure how hard the text is to understand. Such
popular trend of combining link and content detection          features are helpful as automatically generated texts are
methods we focused solely on content analysis.                 usually unreadable. Some readability features have
    The basic natural language characteristics such as         already been investigated by Ntoulas et. al. [17]. We
readability and POS ratios are overviewed in Section           implemented a set of readability features:
3.1. The novel part of our spam detection framework is               Average word length;
a set of text diversity features. We designed a range                Average sentence length;
diversity features based on frequency rank distributions
for different aspects of text diversity. The description
            600                                                                     0,6                             500                                                    0,5

            500                                                                     0,5                             400                                                    0,4




                                                                                                       Number of hosts
Number of hosts



            400                                                                     0,4




                                                                                                                                                                                 Spam ratio
                                                                                          Spam ratio
                                                                                                                    300                                                    0,3
            300                                                                     0,3
                                                                                                                    200                                                    0,2
            200                                                                     0,2

            100                                                                     0,1                             100                                                    0,1

                  0                                                                 0                                    0                                                 0
                                      0.3
                          0.1
                                0.2


                                            0.4
                                                  0.5
                                                        0.6
                                                              0.7
                                                                    0.8
                                                                          0.9
                      0




                                                                                1
                                                                                                                             0.00 0.04 0.08 0.12 0.16         0.2   0.24
                                      Term uniformity                                                                                   Adjective ratio variance
                      Figure 1. Prevalence of spam relative to term                                                          Figure 2. Prevalence of spam relative to
                                       uniformity                                                                            adjective ratio variance across sentences

                      Average number of punctuation symbols per                                           Words in natural texts are known to obey power law
                       sentence;                                                                       frequency distributions. The most notable is Zipf law
                      Ratio of words longer than 7 symbols;                                           [23] that states that the frequency of any term is
                      Ratio of words shorter than 3 symbols;                                          inversely proportional to its rank. Given a word w, with
                      Maximum sentence length;                                                        a frequency rank of rank(w), its frequency may be
                      Minimum sentence length.                                                        estimated using the following formula:
                                                                                                                           const
    The set of 152 statistical features described above                                                     freq ( w)               .
allows detecting simple anomalies in text, such as query                                                                 rank ( w) s
dumping, but it is still inadequate to fight advanced                                                      Parameter s characterizes variety of words in the
types of spam.                                                                                         given corpus of texts. We will refer to this value as to
                                                                                                       uniformity of terms. Greater uniformity leads to greater
3.2 Text Diversity Features                                                                            frequency of the most probable words, and lower
                                                                                                       frequencies of other words. The easiest way to calculate
    Many researchers noticed that entropy and
                                                                                                       uniformity for a document is to convert the Zipf law to
compressibility distinguishes content spam from normal
                                                                                                       logarithmic scale:
texts [17]. We argue that this trait stems from auto
generated nature of content spam. Currently no text                                                        log( freq(w))  s log( rank (w))  const .
generation algorithm can repeat the variety of natural                                                     Using this equation uniformity can be estimated
language.                                                                                              using linear least squares. Let n be the number of
    Some diversity-related features are easily faked by                                                different words in text, then:
spammers. It is not uncommon for content spammers to                                                        f w  log( freq ( w));
use garbage text to decrease compressibility of texts in                                                   rw  log( rank ( w));
attempt to foil spam detection algorithms. To overcome                                                                         n rw f w   rw  f w
these limitations we propose measuring variety of
content in multiple aspects.                                                                                             s      w
                                                                                                                                        2
                                                                                                                                          .  w     w      (*)
                                                                                                                                     
                                                                                                                  n (rw ) 2    rw 
3.2.1 Character-Level Diversity                                                                                     w           w    
    Compressibility is a well-known text variety feature.                                                  We estimated terms uniformity to detect texts that
This characteristic has been used in both e-mail [6] and                                               contain multiple repeating keywords. In order to reduce
web spam detection [17]. Some content spamming                                                         the effect of stopwords we also calculated uniformity
techniques such as keyword stuffing produce texts with                                                 for nouns.
                                                                                                           We also used a simpler approximation of term-level
large number of repetitions. We use gzip and bz2
compression algorithms to measure compressibility of a                                                 diversity by calculating the average number of terms
document.                                                                                              that are repeated in neighbor sentences.
                                                                                                           The prevalence of spam relative to term uniformity
3.2.2 Term-Level Diversity                                                                             is shown in Figure 1. In this figure the horizontal axis
                                                                                                       corresponds to different levels of term uniformity. The
    Compressibility is known to work well, when                                                        white bars correspond to number of hosts from the
repeated keywords are located nearby in text.                                                          WEBSPAM-UK2007 training set with a given level of
Spammers often dilute normal texts with keywords,                                                      term uniformity and the black line corresponds to ratio
thus making them harder to detect. Such subtle                                                         of spam among those hosts. The figure shows that
statistical violations can be detected by analyzing word                                               content spam tends to have greater uniformity, as
frequency distributions.                                                                               spammers often repeat search keywords.
                  Figure 3a. Sample spam page with low topical uniformity             Figure 3b. Sample spam page with high topical uniformity
                   (http://www.harrogate-toy-xmas-fair.co.uk/). The page                 (http://www.sherwoodguesthouseedinburgh.co.uk/).
                          consists of excerpts from different sources.                   Notice keywords in the top and side of the page and
                                                                                                    highlighted keywords in text.



          600                                                     0,3                               0,18

          500                                                     0,25                              0,16
                                                                                                    0,14
Number of hosts




          400                                                     0,2
                                                                                      Topic frequency




                                                                                                    0,12                 Spam: www.harrogate-toy-xmas-fair.co.uk
                                                                         Spam ratio




                                                                                                                         Ham: www.silverlight.co.uk
          300                                                     0,15                                  0,1
                                                                                                                         Spam: www.sherwoodguesthouseedinburgh.co.uk
                                                                                                    0,08
          200                                                     0,1
                                                                                                    0,06
          100                                                     0,05                              0,04

                   0                                              0                                 0,02
                       0   0.2 0.4 0.6 0.8 1 1.2 1.4 1.6                                                 0
                                  Topical uniformity                                                                        Topic frequency rank
                       Figure 4. Prevalence of spam relative to                                               Figure 5. Topical frequency distributions for
                                  topical uniformity                                                                     different types of spam

         3.2.3 Sentence Structure Diversity                                           3.3 Topical Analysis
             Most of content spam generation techniques                                   Web spam has a tendency to belong to several
         produce new unique texts from a set of natural samples.                      popular topics, like insurance, or pornography. We used
         Spammers may use Markovian text generator, which is                          topical features for two purposes. Firstly we used Latent
         trained on a set of natural documents, or they may                           Dirichlet Allocation (LDA) to measure the weights of
         simply take sentences from different texts, to form a                        different topics in texts and used these weights as
         single page content. These techniques often yield                            classification features. Secondly we analyzed the
         locally coherent texts that are hard to detect. To fight                     frequency rank distributions for these weights in order
         these types of spam we developed a set of features to                        to detect topical structure anomalies.
         measure the diversity of styles used in text.
             We elaborated POS features described in Section 3.1                      3.3.1 LDA
         by adding a wide range of linguistic diversity features to                       We decided to implement a set of topical
         detect style anomalies in texts. For each one of 145 POS                     classification features using Latent Dirichlet Allocation
         ratio features we calculated its variance across                             [5]. LDA is a fully generative probabilistic model for
         sentences of text. A distribution of variances of                            texts. LDA assumes that each document is generated by
         adjectives ratio is shown in Figure 2, similar                               a mixture of topics. The weights of these topics can be
         distributions work for other parts of speech and                             used for topical classification. Most importantly LDA
         different grammatical categories. The graph confirms                         weights were used to measure topical diversity of texts.
         our hypotheses that content spam tends to mix styles                         LDA-based topical diversity features are described in
         from different texts, resulting in higher variances.                         Section 3.3.2.
            600                                                                                0,6
                                                                                                                      We also researched an alternative approach to
                                                                                                                  measuring the topical diversity. Being a probabilistic
            500                                                                                0,5                model LDA only generates the most probable topics
                                                                                                                  weights distribution for the text. In order to detect spam
Number of hosts


            400                                                                                0,4                content we calculated the probability of a document




                                                                                                     Spam ratio
                                                                                                                  having uniform topical distribution (all topics having
            300                                                                                0,3
                                                                                                                  the same weight). Considering this as a statistical
            200                                                                                0,2                hypothesis the Pearson's chi-squared statistics can be
                                                                                                                  used to check it. Let N be the number of topics, then:
            100                                                                                0,1
                                                                                                                                                 
                                                                                                                                    1  weight topic
                                                                                                                                                     2

                                                                                                                      2  N N                        .
                  0                                                                            0                              topic      1
                                                             0.18                                                                          N
                      0
                          0.03
                                 0.06
                                        0.09
                                               0.12
                                                      0.15


                                                                    0.21
                                                                           0.24
                                                                                  0.27
                                                                                         0.3
                                                                                                                      We used this statistics as a classification feature.
                                    χ2 score for LDA weights                                                      The prevalence of spam depending on χ2 score is
                                                                                                                  provided in Figure 6. The higher χ2 score means that the
                      Figure 6. Prevalence of spam relative to chi-                                               hosts have lower probability of having uniform topical
                                 squared topical score
                                                                                                                  distribution. The spam probability for hosts with χ 2
    LDA has well-established parameter estimation and                                                             score greater than 0.1 is substantially higher than
inference procedures, based on Monte Carlo Markov                                                                 average spam probability.
chains [2]. We used GibbsLDA++ library [18] that
implements Gibbs Sampling algorithm for inference                                                                 3.4 Machine Learning
and parameter estimation. We trained LDA model on                                                                     Using LDA as a dimensionality reduction algorithm
20K random documents from WEBSPAM-UK2007                                                                          allowed us to use algorithms designed for dense data,
dataset, using 100 topics and   0.5,   0.01 for                                                               without implementing complex ensembles of classifiers.
hyper-parameters.                                                                                                     We used logistic regression with L2 regularization.
    We could have used tf.idf for topical classification,                                                         We used a fixed regularization parameter value of 0.25.
but LDA also served as a dimensionality reduction                                                                 It generates a relatively simple linear classifier with
algorithm. As a result we mapped every document in                                                                regression coefficients which can be interpreted as
100-dimension topic space, instead of using high-                                                                 contribution of features to the classification task. Some
dimensional term vector space. The weights of different                                                           features such as topical uniformity show non-linear
topics served as features in classification.                                                                      behavior that cannot be accounted for using a linear
                                                                                                                  classification formula.
3.3.2 Topical Diversity
    The analysis of LDA topic weights showed that                                                                 3.5 Complexity estimation
these weights also have a power law distribution. Figure                                                              To prove that the proposed algorithm can be used in
5 shows the weights distribution for several samples of                                                           web-scale spam detection tasks we also estimated the
spam and non-spam hosts. Topical distributions are                                                                complexity of the proposed algorithm during the
correlated with term frequency distributions, but have                                                            classification phase. The algorithm can be loosely split
an advantage over them. LDA accounts for correlated                                                               in 3 parts:
terms thus a single LDA topic usually covers a whole                                                                    Statistical features calculation;
set of terms that often co-occur. This ensures that                                                                     Topical diversity estimation based on LDA;
synonyms and similar terms are counted together, and                                                                    Machine learning;
leaves spammers less chances to affect the feature.                                                                   The first phase includes POS tagging and
    For each document we estimated the uniformity of                                                              compressibility analysis. We used simple POS taggers
frequency rank distributions of the LDA weights using                                                             that analyze single words and do not take previous
the formula (*) using topic frequencies instead on word                                                           words in account. The complexity of the POS tagging
frequencies. The prevalence of spam for different levels                                                          process in on the order of document’s length O(|d|).
of topical uniformity is shown in Figure 4. The                                                                       The first phase also includes term-level diversity
probability of spam is greater for hosts with both high                                                           calculation that implies words being sorted by their
and low uniformity. These two zones account for                                                                   frequencies. So the complexity of the diversity
different types of content spam.                                                                                  calculation is on the order of O(|d|log(|d|)).
    Hosts with higher uniformity usually contain texts                                                                The second part of the algorithm starts with LDA
stuffed           with           keywords            (e.g.                                                        inference. Gibbs sampling is used for inference and
www.sherwoodguesthouseedinburgh.co.uk, Figure 3a).                                                                complexity of each iteration is proportional to the
The other group of spam hosts has very low topical                                                                length of the document and number of topics used [18].
entropy. Texts from this group usually contain search                                                             Instead of running Gibbs sampling until convergence
results or sentences taken from multiple other texts (e.g.                                                        we used fixed number of iterations that suited our
www.harrogate-toy-xmas-fair.co.uk,       Figure       3b).                                                        purposed well. So the complexity of the Gibbs sampling
Topical distributions for these hosts are provided in                                                             phase was O(|d|).
Figure 5.
                                     Table 1. Feature strength analysis
                                Feature                                      F-measure          Feature type
                          Topical uniformity                                  91.23%             Diversity
                        Gzip compression rate                                 89.70%             Diversity
                      χ2 score for LDA weights                                87.03%             Diversity
                        bz2 compression rate                                  85.04%             Diversity
                           Term uniformity                                    81.28%             Diversity
       Average number of words repeated in neighbor sentences                 79.60%             Diversity
                       Verbs in past tense ratio                              74.49%             Statistical
     Average number of expressive punctuation marks per sentence              73.54%             Statistical
                     Verbs in past tense variance                             73.34%             Diversity
                        Modal verbs variance                                  72.88%             Diversity
               Fraction of sentences with several verbs                       71.27%             Statistical
                       Personal pronouns ratio                                71.13%             Statistical
                          Proper nouns ratio                                  71.06%             Statistical
                     Possessive endings variance                              70.66%             Diversity
                    Words with one syllable ratio                             70.63%             Statistical
                          Modal verbs ratio                                   70.59%             Statistical
                   Words with two syllables ratio                             70.56%             Statistical
                      Cardinal numbers variance                               70.55%             Diversity
                       Cardinal numbers ratio                                 70.06%             Statistical
                          Determiners ratio                                   69.82%             Statistical

    The calculation of topical diversity after the topic    synthetic documents and random 10K documents from
weights were estimated depends only on number of            the WEBSPAM-UK2007 dataset as a training set. The
topics and its complexity can assumed constant.             test set for the experiment was created in a similar
    Finally in machine learning phase we used constant      fashion. We used two Markov chains of order 2 (MC2)
number of features in a linear classification formula and   and 3 (MC3) to measure the effects of this parameter on
its complexity is also constant. In whole the complexity    classification.
of the proposed classification algorithm is                     In order to measure the effect of the proposed
O(|d|log(|d|)), where |d| is the length of the classified   features we made two runs of the experiment. First we
document.                                                   used only statistical features and LDA weights as a
                                                            baseline experiment (SF+LDA). During the second run
4 Experiments                                               we used all available features including diversity
                                                            features (All).
    The evaluation of the proposed framework consisted           Table 2 contains results of the experiment. High F-
of two experiments. First we tested the ability of our      measure rate suggests that described features are
approach to detect synthetic automatically generated        adequate for detecting such advanced types of content
texts. The second experiment was dedicated to               spam. Increase in Markov chain order causes the
measuring the benefit of the proposed features.             generator repeat larger pieces of original documents.
    Finally we tested the framework in the Web Spam         This reduces detection rate, but increases the amount of
Challenge [20] settings.                                    non-unique content in such texts. The results also show
                                                            that the proposed diversity features substantially
4.1 Synthetic Text Experiment
                                                            improve the classifier. In fact they reduce the number of
    First we tested the capability of the described         false positives and false negatives in half.
features to detect automatically generated low quality
texts. We created a set of synthetic texts using a              Table 2. Precision, Recall, and F-measure for
Markovian text generator. The generator was trained on              synthetic text detection experiment
a collection of 20K random documents from the                                   Precision Recall      F-measure
WEBSPAM-UK2007 dataset. Here is a sample of such
synthetic text generated from this article:                 MC2, SF + LDA        96.19%      96.11%       96.15%

       Tf.idf and other term-weighting approaches are       MC3, SF + LDA        94.08%      92.29%       93.18%
   often used by web spammers to generate thousands
                                                               MC2, All          98.37%      97.93%       98.14%
   of doorway pages, each optimized for a specific
   query, to maximize amount of text, and ratio of             MC3, All          97.72%      97.09%       97.40%
   verbs in past tense compared to all verbs: We used
   POS tagger to tag every word in the dataset.
    Such texts consist of locally coherent pieces
collected from other documents. We used 10K of
                                                               We combined the features into four groups:
     Table 3. Results for WEBSPAM-UK2007                            SF – statistical features (Section 3.1);
                    experiment                                      DF – various text diversity features (Section 3.2,
      Features              AUC        F1                            Section 3.3.2);
                                                                    LDA – the Latent Dirichlet Allocation topic weights
     Geng et. al.              0.85             --                   (Section 3.3.1);
      Biro et. al.            0.854             --              The results of classification using various groups of
         SF                 0.746           0.284           features and machine learning algorithms are provided
         DF                 0.744           0.323           in Table 3. Using the logistic regression the best result
                                                            of 0.871 AUC is achieved when combining all features.
        LDA                 0.845           0.442
                                                            Our approach substantially improves over the nearest
       SF+DF                0.777           0.348           result of 0.854 AUC. The results show that topical
      SF+LDA                0.867           0.433           classification features (LDA) are still crucial to web
      DF+LDA                0.864           0.448           spam detection, but statistical features (SF) and
                                                            diversity features (DF) improve the results substantially.
 All (SF+DF+LDA)            0.871           0.458
              SF – statistical features;
                                                            5 Conclusion and Future Work
    LDA – Latent Dirichlet Allocation topic weights;
             DF – diversity features;                           The results of our research show that advanced
                                                            content features are useful for content spam detection.
                                                            We analyzed different aspects of natural texts and
4.2 Feature Analysis                                        produced a set of features to cover as many aspects as
    The purpose of the second experiment was to             possible. The resulting spam classifier performed well
estimate power of each of the 334 features. The settings    on both synthetic and real-life tasks.
of this experiment were similar to the synthetic text           The proposed approach is based solely on content
detection experiment. We used 20K documents from            analysis and doesn’t take link data into account.
WEBSPAM-UK2007 dataset as a non-spam sample and             Combining the proposed method with existing link-
generated 20K documents using a Markov chain text           spam detection techniques is likely to improve results.
generator with the chain length of 2. These sets were       Another possible extension is to use the diversity
then split evenly in training and testing datasets.         measures and rank distributions on link data to detect
    For each feature we trained a separate classifier.      unnatural link structures.
Each classifier was trained on a single feature. The            Web spam is primarily an economic phenomenon
classification F-measure of the given classifier can be     and the amount of spam depends on efficiency and costs
viewed as a measure of usefulness of the corresponding      of different spam generation techniques. We hope that
feature. Table 1 contains the 20 most useful features for   multiple diversity features described in this work can
synthetic texts classification task.                        substantially decrease the efficiency of automatically
    The results of the experiment show that diversity       generated content spam. There are many properties of
features are paramount for detecting Markov chain           natural texts that are not covered by this article. We
generated texts. The proposed topical diversity features    plan to continue research on various aspects of natural
score best on this metric, along with text                  texts that are hard to reproduce.
compressibility. Other diversity features also can be
seen among the top-20.                                      References
4.3 Webspam-UK2007 Experiment                               [1] J. Abernethy, O. Chapelle, and C. Castillo.
                                                                WITCH: A New Approach to Web Spam
    In this experiment we followed the evaluation               Detection. In Proceedings of the 4th International
protocol of the Web Spam Challenge [20]. Using this             Workshop on Adversarial Information Retrieval on
evaluation procedure we could compare our results with          the Web (AIRWeb), 2008.
other studies. The Web Spam Challenge 2008 was held         [2] C. Andrieu, N. de Freitas, A. Doucet, M. Jordan,
on a WEBSPAM-UK2007 dataset [22]. The training                  An introduction to MCMC for machine learning.
and testing labels are also defined in the dataset. The         Machine Learning, 50: 5–43, 2003.
official quality measure for the challenge was the Area     [3] I. Biro, J. Szabo, A. A. Benczur, Latent Dirichlet
under ROC Curve (AUC ROC). We also calculated                   allocation in web spam filtering, Proceedings of the
optimal F-measure for the classification task.                  4th International Workshop on Adversarial
     We compared against best results on this dataset.          Information Retrieval on the Web, April 22, 2008,
The winners of the 2008 Web Spam Challenge Geng et.             Beijing, China.
al. [12] used pre-computed features and advanced            [4] I. Biro, D. Siklosi, J. Szabo, A. A. Benczur, Linked
bagging strategies to reach the AUC of 0.85. Biro et. al.       latent Dirichlet allocation in web spam filtering,
[4] used linked LDA model to combine link and content           Proceedings of the 5th International Workshop on
features yielding the AUC score of 0.854. Dai et. al. [8]       Adversarial Information Retrieval on the Web,
used temporal features and achieved classification F-           April 21-21, 2009, Madrid, Spain.
measure of 0.521.
[5] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet              international workshop on Adversarial information
     allocation. Journal of Machine Learning Research,           retrieval on the web, April 22, 2008, Beijing,
     3(5):993–1022, 2003.                                        China.
[6] A. Bratko, G. V. Cormack, B. Filipič, T. R. Lynam,      [20] Web Spam Challenge.
     and B. Zupan. Spam filtering using statistical data         http://webspam.lip6.fr/wiki/pmwiki.php, 2008.
     compression models. Journal of Machine Learning        [21] B. Wu, B. D. Davison. Identifying link farm spam
     Research, 7(Dec):2673–2698, 2006.                           pages. Special interest tracks and posters of the
[7] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S.           14th international conference on World Wide Web
     Leonardi, M. Santini, S. Vigna, A reference                 - WWW ’05. 2005.
     collection for web spam, ACM SIGIR Forum, v.40         [22] Yahoo! Research: "Web Spam Collections".
     n.2, p.11-24, December 2006.                                http://barcelona.research.yahoo.net/webspam/datase
[8] N. Dai, B.D. Davison, X. Qi. Looking into the past           ts/ Crawled by the Laboratory of Web
     to better classify web spam. Proceedings of the 5th         Algorithmics, University of Milan,
     International Workshop on Adversarial Information           http://law.dsi.unimi.it/. URLs retrieved May 2007.
     Retrieval on the Web - AIRWeb ’09. 2009.               [23] G. Zipf, Selective Studies and the Principle of
[9] H. Dang. Overview of DUC 2006. Proceedings of                Relative Frequency in Language (Cambridge,
     the Document Understanding. 2006.                           Mass, 1932).
[10] D. Fetterly, M. Manasse, and M. Najork. Detecting
     phrase-level duplication on the world wide web. In
     Proceedings of the 28th ACM International
     Conference on Research and Development in
     Information Retrieval (SIGIR), Salvador, Brazil,
     2005.
[11] D. Fetterly, M. Manasse, and M. Najork. Spam,
     damn spam, and statistics – Using statistical
     analysis to locate spam web pages. In Proceedings
     of the 7th International Workshop on the Web and
     Databases (WebDB), pages 1–6, Paris, France,
     2004.
[12] G. Geng, X. Jin, C.-H. Wang. CASIA at Web Spam
     Challenge 2008 Track III. In Proceedings of the 4th
     International Workshop on Adversarial Information
     Retrieval on the Web (AIRWeb), 2008.
[13] A. Gulin, P. Karpovich. Greedy Function
     Optimization in Learning to Rank, 2009, Available
     at:
     http://romip.ru/russir2009/slides/yandex/lecture.pdf
[14] Z. Gyongyi, H. Garcia-Molina and J. Pedersen.
     Combating Web Spam with TrustRank. In 30th
     International Conference on Very Large Data
     Bases, Aug. 2004.
[15] Z. Gyongyi and H. Garcia-Molina. Web Spam
     Taxonomy. In 1st International Workshop on
     Adversarial Information Retrieval on the Web, May
     2005.
[16] M. Henzinger, R. Motwani, C. Silverstein.
     Challenges in Web Search Engines. SIGIR Forum
     36(2), 2002.
[17] A. Ntoulas , M. Najork , M. Manasse , D. Fetterly,
     Detecting spam web pages through content
     analysis, Proceedings of the 15th international
     conference on World Wide Web, May 23-26, 2006,
     Edinburgh, Scotland.
[18] X.-H. Phan, C.-T. Nguyen, Gibbs LDA++: A
     C/C++ Implementation of Latent Dirichlet
     Allocation (LDA) using Gibbs Sampling for
     Parameter Estimation and Inference.
     http://gibbslda.sourceforge.net/, 2008.
[19] J. Piskorski , M. Sydow , D. Weiss, Exploring
     linguistic features for web spam detection: a
     preliminary study, Proceedings of the 4th