Detecting Content Spam on the Web through Text Diversity Analysis © Anton Pavlov © Boris Dobrov M.V. Lomonosov Moscow State University, M.V. Lomonosov Moscow State Faculty of Computational Mathematics and University, Research Computer Center Cybernetics dobroff@mail.cir.ru pavvloff@yandex.ru Abstract results. Section 5 is dedicated to future work and conclusions. Web spam is considered to be one of the greatest threats to modern search engines. 1.1 Related Work Spammers use a wide range of content Many spam detection techniques have been generation techniques known as content spam proposed in recent years during the Web Spam to fill search results with low quality pages. Challenge [20]. Some content features we used were We argue that content spam must be tackled proposed by Ntoulas et. al. [17]. This work showed that using a wide range of content quality features. compressibility of text and some HTML-related In this paper we propose a set of content characteristics distinguish content spam from normal diversity features based on frequency rank pages. A large amount of linguistic features were distributions for terms and topics. We combine explored in a work by Piskorski et. al. [19]. Latent them with a wide range of other content Dirichlet Allocation [5] is known to perform well in text features to produce a content spam classifier classification tasks. Biro et al. did a lot of research on that outperforms existing results. modifying the LDA model to suit Web spam detection. They developed the multi-corpus LDA [3] and linked 1 Introduction LDA [4] models. The former builds separate LDA Web spam or spamdexing is defined as “any models for spam and ham and uses topic weights as deliberate action that is meant to trigger an unjustifiably classification features. The later incorporates the link favorable relevance or importance for some Web page, data into LDA model to spam classification. considering the page’s true value” [15]. Studies show Web spam is also aimed at web graph features used that at least 20 percent of hosts on the Web are spam by search engines so many researchers focused on [7]. Web spam is widely acknowledged as one of the detecting link spam. Techniques like TrustRank [14] most important challenges to web search engines [16]. minimize the impact spam pages in ranking. Much There is a wide range of spamming techniques attention has been focused on fighting link farms – web usually aimed at different algorithms used in search graph structures designed to accumulate PageRank and engines. This article is dedicated to content spam affect other pages rankings [21]. Finally more and more detection algorithms. Content spamming or term researchers combine the link and content data to spamming refers to “techniques that tailor the contents improve classification results [1, 4]. In this work we of text fields in order to make spam pages relevant for didn’t use any link spam detection techniques as we some queries” [15]. We argue that content spam can be focused on content spam. detected using a combination of text quality features Fetterly et. al. proposed using duplicates analysis to that cover multiple characteristics of natural texts. In detect web spam [10]. They measured phrase-level this work we introduce several novel features based on duplication of content across the web and found that frequency rank distributions for terms and topics that spam tends to have greater number of popular shingles substantially improve content spam classification. per document. In Section 2 we provide basic assumptions behind our research. In Section 3 we describe the content spam 2 Understanding Content Spam detection framework. Section 4 contains evaluation We believe that tackling Web spam is impossible without understanding how it works. Content spamming Proceedings of the Spring Young Researcher's is aimed at text relevance algorithms, such as BM25 Colloquium On Database and Information Systems and tf.idf [15]. These algorithms are particularly SYRCoDIS, Moscow, Russia, 2011 vulnerable to content spam as there is a strong correlation between document relevance and amount of and analysis of these features are provided in Section query terms found in the text. 3.2. Topical classification and topical diversity features Content spam is often used in doorways – pages and based on LDA statistical model are presented in Section sites designed specifically to attract and redirect traffic. 3.3. Doorways are only efficient if they reach the top of All statistics on described features were collected on search results. Spammers prefer to generate thousands WEBSPAM-UK2007 dataset [22]. The spam of doorway pages, each optimized for a specific query, prevalence histograms provided in this section were to maximize amount of traffic collected. generated on the set of 3995 labeled hosts from the This leads to several requirements that content spam training part of the dataset. must satisfy to be efficient:  It must be generated in thousands of pages; 3.1 Statistical Features  Each page must maximize text relevance for The benefit of using wide range of linguistic some search query. features has been shown before by Piskorski et. al [19]. Thus spammers have very little options of These features are commonly used in stylometry and generating content for their doorways: authorship identification. We used POS tagger to tag  They may generate content automatically; every word in the dataset. We also substantially elaborated linguistic features by implementing a set of  They may duplicate texts from other web sites; style-related diversity features that are described in  Or they may use a combination of both Section 3.2. techniques. In order to extract maximum information from POS Automatic text generation is a difficult task that tagging we calculated ratios of different parts of speech does not have a satisfactory solution yet. Natural texts in words and ratios of different grammatical categories: have multiple levels of consistency that are extremely  POS ratios: hard to emulate all at once. In text generation tasks such o Adjectives; as automatic document summarization researchers o Nouns; distinguish multiple qualities of natural texts. o Pronouns; Experiments show that even specialized text generation o Verbs; algorithms score low in most of these measures [9]. o Numerals; The levels of consistency include local coherence, o Particles; style and authorship consistency, topical consistency, o Conjunctions; logical structure of the document etc. In this setting the o Articles; uniqueness of text is just another type of constraint that  Grammatical categories: is inherent for natural texts. Our approach is based on o Number; controlling as much natural text constraints as possible, o Tense; making it harder for spammers to conceal low quality o Aspect; content. o Mood. There is a wide range of text generation techniques Combinations of different parts of speech and that generate locally coherent yet unreadable texts. categories resulted in 82 distinct grammatical forms. Techniques like Markovian text generators are often We calculated the ratio of each grammatical form: used by web spammers to generate unique texts in great # form _ occurences numbers. We were especially interested in designing Ratio ( form)  . text quality analyzer that would detect such advanced # words types of web spam. We also measured ratios of grammatical categories for specific parts of speech, e.g. ratio of verbs in past tense compared to all verbs: # verb _ in _ past _ tense 3 Content Spam Detection Framework Ratio verbs ( past _ tense)  . # verbs Our work was based on assumption that spammers As a result, we used a total of 145 POS-related cannot emulate all aspects of natural texts. Our goal was statistical features. to address as many domains of consistency as possible, Another domain we took features from was text by using various features. We measured multiple readability research. Readability metrics were aspects of text quality and used supervised learning to developed for military and educational purposes to combine them into content spam classifier. Despite a measure how hard the text is to understand. Such popular trend of combining link and content detection features are helpful as automatically generated texts are methods we focused solely on content analysis. usually unreadable. Some readability features have The basic natural language characteristics such as already been investigated by Ntoulas et. al. [17]. We readability and POS ratios are overviewed in Section implemented a set of readability features: 3.1. The novel part of our spam detection framework is  Average word length; a set of text diversity features. We designed a range  Average sentence length; diversity features based on frequency rank distributions for different aspects of text diversity. The description 600 0,6 500 0,5 500 0,5 400 0,4 Number of hosts Number of hosts 400 0,4 Spam ratio Spam ratio 300 0,3 300 0,3 200 0,2 200 0,2 100 0,1 100 0,1 0 0 0 0 0.3 0.1 0.2 0.4 0.5 0.6 0.7 0.8 0.9 0 1 0.00 0.04 0.08 0.12 0.16 0.2 0.24 Term uniformity Adjective ratio variance Figure 1. Prevalence of spam relative to term Figure 2. Prevalence of spam relative to uniformity adjective ratio variance across sentences  Average number of punctuation symbols per Words in natural texts are known to obey power law sentence; frequency distributions. The most notable is Zipf law  Ratio of words longer than 7 symbols; [23] that states that the frequency of any term is  Ratio of words shorter than 3 symbols; inversely proportional to its rank. Given a word w, with  Maximum sentence length; a frequency rank of rank(w), its frequency may be  Minimum sentence length. estimated using the following formula: const The set of 152 statistical features described above freq ( w)  . allows detecting simple anomalies in text, such as query rank ( w) s dumping, but it is still inadequate to fight advanced Parameter s characterizes variety of words in the types of spam. given corpus of texts. We will refer to this value as to uniformity of terms. Greater uniformity leads to greater 3.2 Text Diversity Features frequency of the most probable words, and lower frequencies of other words. The easiest way to calculate Many researchers noticed that entropy and uniformity for a document is to convert the Zipf law to compressibility distinguishes content spam from normal logarithmic scale: texts [17]. We argue that this trait stems from auto generated nature of content spam. Currently no text log( freq(w))  s log( rank (w))  const . generation algorithm can repeat the variety of natural Using this equation uniformity can be estimated language. using linear least squares. Let n be the number of Some diversity-related features are easily faked by different words in text, then: spammers. It is not uncommon for content spammers to f w  log( freq ( w)); use garbage text to decrease compressibility of texts in rw  log( rank ( w)); attempt to foil spam detection algorithms. To overcome n rw f w   rw  f w these limitations we propose measuring variety of content in multiple aspects. s w 2 . w w (*)   n (rw ) 2    rw  3.2.1 Character-Level Diversity w  w  Compressibility is a well-known text variety feature. We estimated terms uniformity to detect texts that This characteristic has been used in both e-mail [6] and contain multiple repeating keywords. In order to reduce web spam detection [17]. Some content spamming the effect of stopwords we also calculated uniformity techniques such as keyword stuffing produce texts with for nouns. We also used a simpler approximation of term-level large number of repetitions. We use gzip and bz2 compression algorithms to measure compressibility of a diversity by calculating the average number of terms document. that are repeated in neighbor sentences. The prevalence of spam relative to term uniformity 3.2.2 Term-Level Diversity is shown in Figure 1. In this figure the horizontal axis corresponds to different levels of term uniformity. The Compressibility is known to work well, when white bars correspond to number of hosts from the repeated keywords are located nearby in text. WEBSPAM-UK2007 training set with a given level of Spammers often dilute normal texts with keywords, term uniformity and the black line corresponds to ratio thus making them harder to detect. Such subtle of spam among those hosts. The figure shows that statistical violations can be detected by analyzing word content spam tends to have greater uniformity, as frequency distributions. spammers often repeat search keywords. Figure 3a. Sample spam page with low topical uniformity Figure 3b. Sample spam page with high topical uniformity (http://www.harrogate-toy-xmas-fair.co.uk/). The page (http://www.sherwoodguesthouseedinburgh.co.uk/). consists of excerpts from different sources. Notice keywords in the top and side of the page and highlighted keywords in text. 600 0,3 0,18 500 0,25 0,16 0,14 Number of hosts 400 0,2 Topic frequency 0,12 Spam: www.harrogate-toy-xmas-fair.co.uk Spam ratio Ham: www.silverlight.co.uk 300 0,15 0,1 Spam: www.sherwoodguesthouseedinburgh.co.uk 0,08 200 0,1 0,06 100 0,05 0,04 0 0 0,02 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 Topical uniformity Topic frequency rank Figure 4. Prevalence of spam relative to Figure 5. Topical frequency distributions for topical uniformity different types of spam 3.2.3 Sentence Structure Diversity 3.3 Topical Analysis Most of content spam generation techniques Web spam has a tendency to belong to several produce new unique texts from a set of natural samples. popular topics, like insurance, or pornography. We used Spammers may use Markovian text generator, which is topical features for two purposes. Firstly we used Latent trained on a set of natural documents, or they may Dirichlet Allocation (LDA) to measure the weights of simply take sentences from different texts, to form a different topics in texts and used these weights as single page content. These techniques often yield classification features. Secondly we analyzed the locally coherent texts that are hard to detect. To fight frequency rank distributions for these weights in order these types of spam we developed a set of features to to detect topical structure anomalies. measure the diversity of styles used in text. We elaborated POS features described in Section 3.1 3.3.1 LDA by adding a wide range of linguistic diversity features to We decided to implement a set of topical detect style anomalies in texts. For each one of 145 POS classification features using Latent Dirichlet Allocation ratio features we calculated its variance across [5]. LDA is a fully generative probabilistic model for sentences of text. A distribution of variances of texts. LDA assumes that each document is generated by adjectives ratio is shown in Figure 2, similar a mixture of topics. The weights of these topics can be distributions work for other parts of speech and used for topical classification. Most importantly LDA different grammatical categories. The graph confirms weights were used to measure topical diversity of texts. our hypotheses that content spam tends to mix styles LDA-based topical diversity features are described in from different texts, resulting in higher variances. Section 3.3.2. 600 0,6 We also researched an alternative approach to measuring the topical diversity. Being a probabilistic 500 0,5 model LDA only generates the most probable topics weights distribution for the text. In order to detect spam Number of hosts 400 0,4 content we calculated the probability of a document Spam ratio having uniform topical distribution (all topics having 300 0,3 the same weight). Considering this as a statistical 200 0,2 hypothesis the Pearson's chi-squared statistics can be used to check it. Let N be the number of topics, then: 100 0,1   1  weight topic 2 2  N N . 0 0 topic 1 0.18 N 0 0.03 0.06 0.09 0.12 0.15 0.21 0.24 0.27 0.3 We used this statistics as a classification feature. χ2 score for LDA weights The prevalence of spam depending on χ2 score is provided in Figure 6. The higher χ2 score means that the Figure 6. Prevalence of spam relative to chi- hosts have lower probability of having uniform topical squared topical score distribution. The spam probability for hosts with χ 2 LDA has well-established parameter estimation and score greater than 0.1 is substantially higher than inference procedures, based on Monte Carlo Markov average spam probability. chains [2]. We used GibbsLDA++ library [18] that implements Gibbs Sampling algorithm for inference 3.4 Machine Learning and parameter estimation. We trained LDA model on Using LDA as a dimensionality reduction algorithm 20K random documents from WEBSPAM-UK2007 allowed us to use algorithms designed for dense data, dataset, using 100 topics and   0.5,   0.01 for without implementing complex ensembles of classifiers. hyper-parameters. We used logistic regression with L2 regularization. We could have used tf.idf for topical classification, We used a fixed regularization parameter value of 0.25. but LDA also served as a dimensionality reduction It generates a relatively simple linear classifier with algorithm. As a result we mapped every document in regression coefficients which can be interpreted as 100-dimension topic space, instead of using high- contribution of features to the classification task. Some dimensional term vector space. The weights of different features such as topical uniformity show non-linear topics served as features in classification. behavior that cannot be accounted for using a linear classification formula. 3.3.2 Topical Diversity The analysis of LDA topic weights showed that 3.5 Complexity estimation these weights also have a power law distribution. Figure To prove that the proposed algorithm can be used in 5 shows the weights distribution for several samples of web-scale spam detection tasks we also estimated the spam and non-spam hosts. Topical distributions are complexity of the proposed algorithm during the correlated with term frequency distributions, but have classification phase. The algorithm can be loosely split an advantage over them. LDA accounts for correlated in 3 parts: terms thus a single LDA topic usually covers a whole  Statistical features calculation; set of terms that often co-occur. This ensures that  Topical diversity estimation based on LDA; synonyms and similar terms are counted together, and  Machine learning; leaves spammers less chances to affect the feature. The first phase includes POS tagging and For each document we estimated the uniformity of compressibility analysis. We used simple POS taggers frequency rank distributions of the LDA weights using that analyze single words and do not take previous the formula (*) using topic frequencies instead on word words in account. The complexity of the POS tagging frequencies. The prevalence of spam for different levels process in on the order of document’s length O(|d|). of topical uniformity is shown in Figure 4. The The first phase also includes term-level diversity probability of spam is greater for hosts with both high calculation that implies words being sorted by their and low uniformity. These two zones account for frequencies. So the complexity of the diversity different types of content spam. calculation is on the order of O(|d|log(|d|)). Hosts with higher uniformity usually contain texts The second part of the algorithm starts with LDA stuffed with keywords (e.g. inference. Gibbs sampling is used for inference and www.sherwoodguesthouseedinburgh.co.uk, Figure 3a). complexity of each iteration is proportional to the The other group of spam hosts has very low topical length of the document and number of topics used [18]. entropy. Texts from this group usually contain search Instead of running Gibbs sampling until convergence results or sentences taken from multiple other texts (e.g. we used fixed number of iterations that suited our www.harrogate-toy-xmas-fair.co.uk, Figure 3b). purposed well. So the complexity of the Gibbs sampling Topical distributions for these hosts are provided in phase was O(|d|). Figure 5. Table 1. Feature strength analysis Feature F-measure Feature type Topical uniformity 91.23% Diversity Gzip compression rate 89.70% Diversity χ2 score for LDA weights 87.03% Diversity bz2 compression rate 85.04% Diversity Term uniformity 81.28% Diversity Average number of words repeated in neighbor sentences 79.60% Diversity Verbs in past tense ratio 74.49% Statistical Average number of expressive punctuation marks per sentence 73.54% Statistical Verbs in past tense variance 73.34% Diversity Modal verbs variance 72.88% Diversity Fraction of sentences with several verbs 71.27% Statistical Personal pronouns ratio 71.13% Statistical Proper nouns ratio 71.06% Statistical Possessive endings variance 70.66% Diversity Words with one syllable ratio 70.63% Statistical Modal verbs ratio 70.59% Statistical Words with two syllables ratio 70.56% Statistical Cardinal numbers variance 70.55% Diversity Cardinal numbers ratio 70.06% Statistical Determiners ratio 69.82% Statistical The calculation of topical diversity after the topic synthetic documents and random 10K documents from weights were estimated depends only on number of the WEBSPAM-UK2007 dataset as a training set. The topics and its complexity can assumed constant. test set for the experiment was created in a similar Finally in machine learning phase we used constant fashion. We used two Markov chains of order 2 (MC2) number of features in a linear classification formula and and 3 (MC3) to measure the effects of this parameter on its complexity is also constant. In whole the complexity classification. of the proposed classification algorithm is In order to measure the effect of the proposed O(|d|log(|d|)), where |d| is the length of the classified features we made two runs of the experiment. First we document. used only statistical features and LDA weights as a baseline experiment (SF+LDA). During the second run 4 Experiments we used all available features including diversity features (All). The evaluation of the proposed framework consisted Table 2 contains results of the experiment. High F- of two experiments. First we tested the ability of our measure rate suggests that described features are approach to detect synthetic automatically generated adequate for detecting such advanced types of content texts. The second experiment was dedicated to spam. Increase in Markov chain order causes the measuring the benefit of the proposed features. generator repeat larger pieces of original documents. Finally we tested the framework in the Web Spam This reduces detection rate, but increases the amount of Challenge [20] settings. non-unique content in such texts. The results also show that the proposed diversity features substantially 4.1 Synthetic Text Experiment improve the classifier. In fact they reduce the number of First we tested the capability of the described false positives and false negatives in half. features to detect automatically generated low quality texts. We created a set of synthetic texts using a Table 2. Precision, Recall, and F-measure for Markovian text generator. The generator was trained on synthetic text detection experiment a collection of 20K random documents from the Precision Recall F-measure WEBSPAM-UK2007 dataset. Here is a sample of such synthetic text generated from this article: MC2, SF + LDA 96.19% 96.11% 96.15% Tf.idf and other term-weighting approaches are MC3, SF + LDA 94.08% 92.29% 93.18% often used by web spammers to generate thousands MC2, All 98.37% 97.93% 98.14% of doorway pages, each optimized for a specific query, to maximize amount of text, and ratio of MC3, All 97.72% 97.09% 97.40% verbs in past tense compared to all verbs: We used POS tagger to tag every word in the dataset. Such texts consist of locally coherent pieces collected from other documents. We used 10K of We combined the features into four groups: Table 3. Results for WEBSPAM-UK2007  SF – statistical features (Section 3.1); experiment  DF – various text diversity features (Section 3.2, Features AUC F1 Section 3.3.2);  LDA – the Latent Dirichlet Allocation topic weights Geng et. al. 0.85 -- (Section 3.3.1); Biro et. al. 0.854 -- The results of classification using various groups of SF 0.746 0.284 features and machine learning algorithms are provided DF 0.744 0.323 in Table 3. Using the logistic regression the best result of 0.871 AUC is achieved when combining all features. LDA 0.845 0.442 Our approach substantially improves over the nearest SF+DF 0.777 0.348 result of 0.854 AUC. The results show that topical SF+LDA 0.867 0.433 classification features (LDA) are still crucial to web DF+LDA 0.864 0.448 spam detection, but statistical features (SF) and diversity features (DF) improve the results substantially. All (SF+DF+LDA) 0.871 0.458 SF – statistical features; 5 Conclusion and Future Work LDA – Latent Dirichlet Allocation topic weights; DF – diversity features; The results of our research show that advanced content features are useful for content spam detection. We analyzed different aspects of natural texts and 4.2 Feature Analysis produced a set of features to cover as many aspects as The purpose of the second experiment was to possible. The resulting spam classifier performed well estimate power of each of the 334 features. The settings on both synthetic and real-life tasks. of this experiment were similar to the synthetic text The proposed approach is based solely on content detection experiment. We used 20K documents from analysis and doesn’t take link data into account. WEBSPAM-UK2007 dataset as a non-spam sample and Combining the proposed method with existing link- generated 20K documents using a Markov chain text spam detection techniques is likely to improve results. generator with the chain length of 2. These sets were Another possible extension is to use the diversity then split evenly in training and testing datasets. measures and rank distributions on link data to detect For each feature we trained a separate classifier. unnatural link structures. Each classifier was trained on a single feature. The Web spam is primarily an economic phenomenon classification F-measure of the given classifier can be and the amount of spam depends on efficiency and costs viewed as a measure of usefulness of the corresponding of different spam generation techniques. We hope that feature. Table 1 contains the 20 most useful features for multiple diversity features described in this work can synthetic texts classification task. substantially decrease the efficiency of automatically The results of the experiment show that diversity generated content spam. There are many properties of features are paramount for detecting Markov chain natural texts that are not covered by this article. We generated texts. The proposed topical diversity features plan to continue research on various aspects of natural score best on this metric, along with text texts that are hard to reproduce. compressibility. Other diversity features also can be seen among the top-20. References 4.3 Webspam-UK2007 Experiment [1] J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam In this experiment we followed the evaluation Detection. In Proceedings of the 4th International protocol of the Web Spam Challenge [20]. Using this Workshop on Adversarial Information Retrieval on evaluation procedure we could compare our results with the Web (AIRWeb), 2008. other studies. The Web Spam Challenge 2008 was held [2] C. Andrieu, N. de Freitas, A. Doucet, M. Jordan, on a WEBSPAM-UK2007 dataset [22]. The training An introduction to MCMC for machine learning. and testing labels are also defined in the dataset. The Machine Learning, 50: 5–43, 2003. official quality measure for the challenge was the Area [3] I. Biro, J. Szabo, A. A. Benczur, Latent Dirichlet under ROC Curve (AUC ROC). We also calculated allocation in web spam filtering, Proceedings of the optimal F-measure for the classification task. 4th International Workshop on Adversarial We compared against best results on this dataset. Information Retrieval on the Web, April 22, 2008, The winners of the 2008 Web Spam Challenge Geng et. Beijing, China. al. [12] used pre-computed features and advanced [4] I. Biro, D. Siklosi, J. Szabo, A. A. Benczur, Linked bagging strategies to reach the AUC of 0.85. Biro et. al. latent Dirichlet allocation in web spam filtering, [4] used linked LDA model to combine link and content Proceedings of the 5th International Workshop on features yielding the AUC score of 0.854. Dai et. al. [8] Adversarial Information Retrieval on the Web, used temporal features and achieved classification F- April 21-21, 2009, Madrid, Spain. measure of 0.521. [5] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet international workshop on Adversarial information allocation. Journal of Machine Learning Research, retrieval on the web, April 22, 2008, Beijing, 3(5):993–1022, 2003. China. [6] A. Bratko, G. V. Cormack, B. Filipič, T. R. Lynam, [20] Web Spam Challenge. and B. Zupan. Spam filtering using statistical data http://webspam.lip6.fr/wiki/pmwiki.php, 2008. compression models. Journal of Machine Learning [21] B. Wu, B. D. Davison. Identifying link farm spam Research, 7(Dec):2673–2698, 2006. pages. Special interest tracks and posters of the [7] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. 14th international conference on World Wide Web Leonardi, M. Santini, S. Vigna, A reference - WWW ’05. 2005. collection for web spam, ACM SIGIR Forum, v.40 [22] Yahoo! Research: "Web Spam Collections". n.2, p.11-24, December 2006. http://barcelona.research.yahoo.net/webspam/datase [8] N. Dai, B.D. Davison, X. Qi. Looking into the past ts/ Crawled by the Laboratory of Web to better classify web spam. Proceedings of the 5th Algorithmics, University of Milan, International Workshop on Adversarial Information http://law.dsi.unimi.it/. URLs retrieved May 2007. Retrieval on the Web - AIRWeb ’09. 2009. [23] G. Zipf, Selective Studies and the Principle of [9] H. Dang. Overview of DUC 2006. Proceedings of Relative Frequency in Language (Cambridge, the Document Understanding. 2006. Mass, 1932). [10] D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. [11] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1–6, Paris, France, 2004. [12] G. Geng, X. Jin, C.-H. Wang. CASIA at Web Spam Challenge 2008 Track III. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008. [13] A. Gulin, P. Karpovich. Greedy Function Optimization in Learning to Rank, 2009, Available at: http://romip.ru/russir2009/slides/yandex/lecture.pdf [14] Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004. [15] Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005. [16] M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. [17] A. Ntoulas , M. Najork , M. Manasse , D. Fetterly, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland. [18] X.-H. Phan, C.-T. Nguyen, Gibbs LDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference. http://gibbslda.sourceforge.net/, 2008. [19] J. Piskorski , M. Sydow , D. Weiss, Exploring linguistic features for web spam detection: a preliminary study, Proceedings of the 4th