Wikipedia Vandalism Detection Through Machine
      Learning: Feature Review and New Proposals∗
                       Lab Report for PAN at CLEF 2010

                                 Santiago M. Mola Velasco

                                            Private
                                    coldwind@coldwind.org


       Abstract Wikipedia is an online encyclopedia that anyone can edit. In this open
       model, some people edits with the intent of harming the integrity of Wikipedia.
       This is known as vandalism. We extend the framework presented in (Potthast,
       Stein, and Gerling, 2008) for Wikipedia vandalism detection. In this approach,
       several vandalism indicating features are extracted from edits in a vandalism cor-
       pus and are fed to a supervised learning algorithm. The best performing classifiers
       were LogitBoost and Random Forest. Our classifier, a Random Forest, obtained
       an AUC of 0.92236, ranking in the first place of the PAN’10 Wikipedia vandalism
       detection task.


1    Introduction

Wikipedia is an online encyclopedia built upon the collaborations of thousands of edi-
tors. Its collaboration model is simple: anyone can edit any article at any time. This has
made possible the great success of Wikipedia, but it comes with its own problems, one
of them being destructive edits. There are many ways in which an edit can be destructive
for Wikipedia, such as lobbying, spam, vandalism, tests, etc.
    In PAN 2010 Lab’s Task 2 we are focused on automatic detection of vandalism. The
English Wikipedia defines vandalism as:

    [...] any addition, removal, or change of content made in a deliberate attempt
    to compromise the integrity of Wikipedia. [...] Common types of vandalism are
    the addition of obscenities or crude humor, page blanking, and the insertion of
    nonsense into articles.
    Any good-faith effort to improve the encyclopedia, even if misguided or ill-
    considered, is not vandalism. Even harmful edits that are not explicitly made
    in bad faith are not vandalism.
    (Wikipedia contributors, 2010a)

   The Wikipedia community maintains a project for studying vandalism (Wikipedia
contributors, 2010b) and develops bots for automatic detection and reversion. Currently,
    ∗
      Thanks to the PAN’10 organizers, to the reviewers, to Dmitry Chichkov for his contributions
to the PAN’10 group and to Sandra García Blasco for her support and ideas.
the most prominent active bot in the English Wikipedia is ClueBot, a rule-based sys-
tem. (Carter, 2010) ClueBot’s rules rely on length increment of the edit and a scoring
system based on regular expressions checking vulgarisms, grammar, etc. This system
achieves a precision near to 1 but a very low recall. Similar systems can be found in
other Wikipedias, such as AVBOT (Rodríguez Posada, 2010) in the Spanish edition.
    (Potthast, Stein, and Gerling, 2008) defined Wikipedia vandalism detection as a
classification task, proposed a set of discriminating features and provided evaluation
of them based on machine learning. Our work is founded in that definition, extending
the set of features proposed in their work. We have built a system trying to reproduce
previous results, and afterwards, extended it with a new range of features and tuning of
previous ones.
    We have used the PAN-WVC-10 corpus (Potthast, 2010). The training set of this
corpus comprises 15000 annotated edits, 924 of which are labeled as vandalism. Our
goal is building a vandalism classifier using this corpus.
    The rest of the paper is structured as follows. In Section 2 of this article, we will
describe the preprocessing we have applied to the data. In Section 3 we discuss all
edit features that we have used. In Section 4 we present the classifiers we have used.
In Section 5 we will evaluate our features and classifiers and present the classifier we
have used for the PAN’10 result submission. Finally, we will present our conclusions in
Section 6.


2     Preprocessing

In our feature framework, we have defined different representations of edits to use in
different features. In this section we will describe these representations and the steps to
construct them.
    First, we defined a tokenization function. A token can be a word, a number, a punc-
tuation mark or a wiki-syntax element. When describing features, we will consider a
word as any of these tokens. The following symbols are considered independent to-
kens: ., ,, :, ;, ", «, », ’, |, ?, !, =, (, ), *, [[, ]], [. ], {{, }}, { and }.
    We also need to produce diffs between the old and new revisions of an edit. To that
end, we used google-diff-match-patch.1
    Using these funcions, we define the following representations of edit text:

    – Old text and new text. The old and new without any preprocess.
    – Case-sensitive inserted words. The set of inserted words.
    – Inserted words. The set of inserted words, and converted to lowercase.
    – Concatenated inserted words. All inserted words concatenated and separated with
      spaces. This is defined both case-sensitive and insensitive.
    – Inserted text. Inserted lines as reported by the diff algorithm.
    1
      Available at google-diff-match-patch. Behaviour of different parameters of the
diff function has not been studied yet. The parameters used were Patch_Margin set to 0,
Diff_EditCost to 6 and Match_Distance to 500.
3       Edit Features
In this section, we describe our set of features. Features based on dictionaries will be
explained separately. Features marked with ∗ were already defined in (Potthast, Stein,
and Gerling, 2008) and those marked with † are modifications of features also defined
in that work.
Anonymous∗ Wether the editor is anonymous or not.
   Vandals are likely to be anonymous. This feature is used in a way or another in
   most antivandalism working bots such as ClueBot and AVBOT. In the PAN-WVC-
   10 training set (Potthast, 2010) anonymous edits represent 29% of the regular edits
   and 87% of vandalism edits.
Comment length∗ Length in characters of the edit summary.
   Long comments might indicate regular editing and short or blank ones might sug-
   gest vandalism, however, this feature is quite weak, since leaving an empty com-
   ment in regular editing is a common practice.
Upper to lower ratio† Uppercase to lowercase letters ratio, i.e., 1+|upper|
                                                                      1+|lower| .
   Vandals often do not follow capitalization rules, writing everything in lowercase or
   in uppercase.
                                                                         1+|upper|
Upper to all ratio† Uppercase letters to all letters to ratio, i.e., 1+|lower|+|upper| .

Digit ratio Digit to all characters ratio, i.e., 1+|digit|
                                                  1+|all| .
    This feature helps to spot minor edits that only change numbers, which might help
    to find some cases of subtle vandalism where the vandal changes arbitrarily a date
    or a number to introduce misinformation.
Non-alphanumeric ratio Non-alphanumeric to all characters ratio, i.e., 1+|nonalphanumeric|
                                                                                    1+|all| .
    An excess of non-alphanumeric characters in short texts might indicate excessive
    use of exclamation marks or emoticons.
Character diversity Measure of different characters compared to the length of in-
                                                           1
    serted text, given by the expression length dif f erent chars .
    This feature helps to spot random keyboard hits and other non-sense. It should take
    into account QWERTY keyboard layout in the future.
Character distribution† Kullback-Leibler divergence of the character distribution of
    the inserted text with respect the expectation. Useful to detect non-sense.
Compressibility† Compression rate of inserted text using the LZW algorithm. 2
    Useful to detect non-sense, repetitions of the same character or words, etc.
Size increment Absolute increment of size, i.e., |new| − |old|.
    The value of this feature is already well-established. ClueBot uses various thresh-
    olds of size increment for its heuristics, e.g., a big size decrement is considered an
    indicator of blanking.
Size ratio∗ Size of the new revision relative to the old revision, i.e., 1+|new|
                                                                          1+|old| .
    Complements size increment.
    2
     LZW was chosen after evaluating the behaviour of LZW, gzip and bzip2, although, an ex-
haustive comparison is still pending. We used the TIFF LZW algorithm (Adobe Developers
Association, 1992) as implemented in python-lzw 0.01 by Joe Bowers, available at http:
//www.joe-bowers.com/static/lzw/.
Average term frequency∗ Average relative frequency of inserted words in the new
   revision.
   In long and well-established articles too many words that do not appear in the rest
   of the article indicates that the edit might be including non-sense or non-related
   content.
Longest word∗ Length of the longest word in inserted text.
   Useful to detect non-sense.
Longest character sequence∗ Longest consecutive sequence of the same character in
   the inserted text.
   Long sequences of the same character are frequent in vandalism (e.g. aaggggghhh-
   hhhh!!!!!, soooooo huge).

     Size increment and ratio are closely related to all features, as can be seen in ClueBot
and AVBOT source code, where thresholds for these measures are coded deep into most
heuristics. This should be considered when building a classifier.
     Upper to all, upper to lower, digit and non-alphanumeric ratios were computed us-
ing case-sensitive concatenated inserted words. These could also be computed using a
diff or computing the difference of their values in the old and new text. Each method
produces different results. We chose only inserted words for efficiency.
     Antivandalism bots use lists of regular expressions with weights in order to com-
pute a vandalism score. In (Potthast, Stein, and Gerling, 2008) a similar approach is
described, where vulgarism frequency and impact are considered. Vulgarism frequency
is the frequency of vulgarisms in an edit text relative to all words in the edit. Vulgarism
impact is the percentage by which an edit increases the number of vulgarisms in the
text. In the same way, pronoun frequency and impact were defined.
     We have defined features analogous to vulgarism frequency and vulgarism impact,
for different categories of words:

Vulgarisms Vulgar and offensive words, e.g., fuck, suck, stupid.
Pronouns First and second person pronouns, including slang spellings, e.g., I, you, ya.
Biased Colloquial words with high bias, e.g., coolest, huge.
Sex Non-vulgar sex-related words, e.g., sex, penis, nipple.
Bad Hodgepodge category for colloquial contractions (e.g. wanna, gotcha), typos (e.g.
    dosent), etc.
All A meta-category, containing vulgarisms, pronouns, biased, sex-related and bad
    words.
Good Words rarely used by vandals, mainly wiki-syntax elements (e.g. __TOC__,
    <ref>)

    Word counters were computed using the inserted words in lowercase. In future
work, it would be desirable to consider caseness and style. For example, dick is very
likely to be slang for penis and DICK is a strong indicator of vandalism, but Dick is
more likely to be the diminutive for Richard or a surname. Our current system does not
make this distinction.
4   Classification

Classification has been conducted using the Weka framework (Hall et al., 2009). After
preliminary evaluations, we have tried to choose classifiers which fullfil all or most
of the following conditions: a) require little or no preprocessing of data, b) require
little parameter adjustment, c) do implicit feature selection, d) are resistant to noise and
outliers and e) are resistant to severe class imbalance.
     Our baseline classifier is C4.5 decision tree (Quinlan, 1993) which is a well-established
algorithm and, to some extent, fullfils our criteria. LogitBoost (Friedman, Hastie, and
Tibshirani, 2000) and Random Forest (Breiman, 2001) are attractive because of their
implicit feature selection, generalization properties and a low number of parameters.
We expect Random Forest to properly exploit relations between features such as those
described in Section 3. Random Forest has another interesting property: it uses an in-
ternal test error estimation known as out of bag error which makes cross-validation
prescindible.
     We also chose Support Vector Machines (SVM) because of its resistance to class
imbalance (Japkowicz and Stephen, 2002), however, SVM has been discarded for in-
clusion in the PAN’10 results because of its high computational demands and amount
of parameter to adjust.


5   Evaluation

The following measures have been evaluated in every experiment: Precision, Recall,
F-Measure and Area Under ROC Curve (AUC).
    In Table 1 we can see the performance of a C4.5 decision tree built for each feature.
The C4.5 algorithm, considering only one feature and with severe class imbalance, will
often produce a majority class classifier. In order to prevent this, we changed the weight
of vandalism edits so the cost of a misclassified vandalism edit to be 10 times higher
than a misclassified regular edit. This is meant to be orientative, not an accurate ranking
of features.
    In Table 2 we can see the performance of all the classifiers considering all features.
Every classifier has been evaluated with 10-fold cross-validation. Parameters used for
classifiers were: 3

C4.5 minimum number of instances per leaf is 6, Laplace smoothing is used for pre-
   dicted probabilites.
LogitBoost 100, 500 and 1000 boost iterations. The weak classifier is a Decision
   Stump.
Random Forest 100, 500 and 1000 iterations.

   LogitBoost and Random Forest perform similarly in this task. Random Forest im-
proves F-Measure and AUC as we increase the number of iterations. While LogitBoost
shows slightly better performance than Random Forest, its results are less stable and
   3
     Any omitted parameter uses default parameters in Weka. For more information, check Weka
API documentation at http://weka.sourceforge.net/doc.stable/.
            Table 1. Evaluation of individual features with a C4.5 decision tree.

               Feature                  Precision Recall F-Measure AUC
               Anonymous                  0.166 0.872      0.278    0.78
               Comment length             0.102 0.663      0.177   0.661

               Upper to lower ratio       0.131    0.629         0.217   0.755
               Upper to all ratio         0.129    0.621         0.214   0.731
               Digit ratio                0.128    0.718         0.217    0.75
               Non-alphanumeric ratio     0.117    0.629         0.197   0.729
               Character diversity        0.121    0.689         0.206   0.741
               Character distribution      0.11    0.698          0.19    0.71
               Compressibility            0.095     0.29         0.143   0.627

               Size increment             0.111    0.669         0.191    0.69
               Size ratio                  0.1     0.195         0.132   0.572
               Average term frequency     0.207    0.021         0.037   0.503

               Longest word               0.124    0.712         0.211   0.742
               Longest character seq.     0.115    0.514         0.189   0.628

               All frequency              0.762    0.353         0.482   0.661
               All impact                 0.501    0.377         0.43    0.662
               Vulgarism frequency        0.904    0.214         0.346   0.597
               Vulgarism impact           0.618    0.227         0.332   0.599
               Bad word frequency         0.747     0.08         0.145   0.532
               Bad word impact            0.348    0.111         0.169   0.539
               Biased frequency           0.623    0.082         0.145   0.536
               Biased impact              0.311    0.173         0.223   0.561
               Sex frequency              0.717    0.041         0.078   0.517
               Sex impact                 0.537    0.048         0.087   0.518
               Pronoun frequency          0.588    0.152         0.241   0.562
               Pronoun impact             0.311    0.173         0.223   0.561
               Goodword frequency           0         0            0     0.545
               Goodword impact            0.149    0.106         0.124   0.584

                               Table 2. Classifier evaluation.

               Classifier            Precision Recall F-Measure AUC
               C4.5                    0.773 0.543      0.638    0.93
               Random Forest 100 it.   0.855 0.568      0.683   0.957
               Random Forest 500 it.    0.86   0.565    0.682   0.962
               Random Forest 1000 it. 0.861 0.568       0.684   0.963
               LogitBoost 100 it.      0.845 0.566      0.678   0.966
               LogitBoost 500 it.       0.82   0.606    0.697   0.964
               LogitBoost 1000 it.     0.795 0.611      0.691   0.961


decrease while increasing number iterations, suggesting a problem of overfitting. This
is, presumably, because of LogitBoost’s sensitiveness to noise and outliers.
    Our final classifier was a Random Forest of 1000 trees, each one built while con-
sidering 5 random features4 , obtaining an AUC of 0.92236 when evaluated with the
PAN-WVC-10 testing corpus.


6       Conclusions

In this work, we revisited (Potthast, Stein, and Gerling, 2008) approach on Wikipedia
vandalism detection, tuning some features and introducing others. We further explored
features based on word lists, expanding them beyond vulgarisms and improving results
by creating new categories. We performed a brief comparison of different supervised
learning techniques for this task. Random Forest and LogitBoost produced good results.
    Our final classifier, a Random Forest, obtained an AUC of 0.92236, which was
ranked in the first place in PAN’10 for the vandalism detection task. Precision and recall
on the training test were 0.861 and 0.568, respectively. This is far better than current
antivandalism systems in terms of F-Measure, but it has neither precision near to 1,
which is required for a bot in order to operate autonomously, nor a recall high enough
to be used as a filtering system for human editors. In any case, our results suggest
that this machine learning based approach is a suitable path for the next-generation
antivandalism bots.
    Future work might include a more exhaustive feature evaluation; building bigger
word lists using free lexical resources such as Wiktionary and WordNet and infer their
weights; introducing a feature for detecting random keyboard hits considering QWERTY
keyboard layout; introducing heuristics from current antivandalism bots and experiment
with measures based on ngram language models.


References

Adobe Developers Association, 1992. TIFF Revision 6.0: Final. Adobe Systems
    Incorporated, June.
Breiman, Leo. 2001. Random Forests. Machine Learning, 45(1):5–32.
Carter, Jacobi. 2010. ClueBot and Vandalism on Wikipedia. http://www.acm.
    uiuc.edu/~carter11/ClueBot.pdf.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2000. Additive Logistic
    Regression: a Statistical View of Boosting. Annals of Statistics, 28.
Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and
    Ian H. Witten. 2009. The WEKA Data Mining Software: an Update. SIGKDD
    Explorations, 11(1):10–18.
Japkowicz, Nathalie and Shaju Stephen. 2002. The Class Imbalance Problem: A
    Systematic Study. Intelligent Data Analysis, 6(5):429–449, February.
Potthast, Martin. 2010. Crowdsourcing a Wikipedia Vandalism Corpus. In 33rd
    Annual International ACM SIGIR Conference. ACM, July.
    4
        That is, log2 M + 1, where M is the number of features.
Potthast, Martin, Benno Stein, and Robert Gerling. 2008. Automatic vandalism de-
    tection in wikipedia. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian
    Ruthven, and Ryen W. White, editors, Advances in Information Retrieval, volume
    4956 of Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, chap-
    ter 75, pages 663–668.
Quinlan, J. Ross. 1993. C4.5: programs for machine learning. Morgan Kaufmann
    Publishers Inc., San Francisco, CA, USA.
Rodríguez Posada, Emilio José. 2010. AVBOT: detección y corrección de vandalismos
    en Wikipedia. NovATIca, (203):51–53.
Wikipedia contributors. 2010a. Wikipedia:Vandalism. In Wikipedia, The Free
    Encyclopedia.       http://en.wikipedia.org/w/index.php?title=
    Wikipedia:Vandalism&oldid=366579916.
Wikipedia contributors. 2010b. Wikipedia:WikiProject Vandalism studies. In
    Wikipedia, The Free Encyclopedia.           http://en.wikipedia.org/
    w/index.php?title=Wikipedia:WikiProject_Vandalism_
    studies&oldid=365253014.