=Paper= {{Paper |id=Vol-1177/CLEF2011wn-PAN-WestEt2011 |storemode=property |title=Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011 |pdfUrl=https://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-WestEt2011.pdf |volume=Vol-1177 |dblpUrl=https://dblp.org/rec/conf/clef/WestL11 }} ==Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011== https://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-WestEt2011.pdf

Multilingual Vandalism Detection using
Language-Independent & Ex Post Facto Evidence
Notebook for PAN at CLEF 2011

Andrew G. West and Insup Lee

Dept. of Computer and Information Science
University of Pennsylvania - Philadelphia, PA
{westand, lee}@cis.upenn.edu

Abstract There is much literature on Wikipedia vandalism detection. However,
this writing addresses two facets given little treatment to date. First, prior efforts
emphasize zero-delay detection, classifying edits the moment they are made. If
classification can be delayed (e.g., compiling offline distributions), it is possible
to leverage ex post facto evidence. This work describes/evaluates several features
of this type, which we find to be overwhelmingly strong vandalism indicators.
Second, English Wikipedia has been the primary test-bed for research. Yet,
Wikipedia has 200+ language editions and use of localized features impairs porta-
bility. This work implements an extensive set of language-independent indicators
and evaluates them using three corpora (German, English, Spanish). The work
then extends to include language-specific signals. Quantifying their performance
benefit, we find that such features can moderately increase classifier accuracy, but
significant effort and language fluency are required to capture this utility.
Aside from these novel aspects, this effort also broadly addresses the task,
implementing 65 total features. Evaluation produces 0.840 PR-AUC on the zero-
delay task and 0.906 PR-AUC with ex post facto evidence (averaging languages).
Performance matches the state-of-the-art (English), sets novel baselines (German,
Spanish), and is validated by a first-place finish over the 2011 PAN-CLEF test set.

1 Introduction
Unconstructive or ill-intentioned edits (i.e., vandalism) on Wikipedia erode the encyclo-
pedia’s reputation and waste the utility of those who must locate/remove the damage.
Moreover, while Wikipedia is the focus of this work, these are issues that affect all wiki
environments and collaborative software [9]. Classifiers capable of detecting vandalism
can mitigate these issues by autonomously undoing poor edits or prioritizing human ef-
forts in locating them. Numerous proposals have addressed this need, as well surveyed
in [2,6,9]. These techniques span multiple domains, including natural language pro-
cessing (NLP), reputation algorithms, and metadata analysis. Recently, our own prior
work [2] combined the leading approaches from these domains to establish a new per-
formance baseline; our technique herein borrows heavily from that effort.
The 2011 edition of the PAN-CLEF vandalism detection competition, however, has
slightly redefined the task relative to the 2010 competition [6] and the bulk of existing
anti-vandalism research. In particular, two differences have motivated novel analysis
and feature development. First, the prior edition permitted only zero-delay features: an
edit simultaneously committed and evaluated at time tn can only leverage information
from time t ≤ tn . However, if evaluation can be delayed until time tn+m , it is possi-
ble to use ex post facto evidence from the tn < t ≤ tn+m interval to aid predictive
efforts. While such features are not relevant for “gate-keeping,” they still have applica-
tions. For example, the presence of vandalism would severely undermine static content
distributions like the Wikipedia 1.0 project1 , which targets educational settings. This
work describes/evaluates several ex post facto features and finds them to be very strong
vandalism predictors.
The second redefinition is that this year’s corpus contains edits from three lan-
guages: German, English, and Spanish. Prior research, however, has been conducted
almost exclusively in English, and the 2010 PAN-CLEF winning approach heavily uti-
lized English-specific dictionaries [6,8]. Such techniques do not lend themselves to
portability across Wikipedia’s 200+ language editions, motivating the use of language-
independent features. While these are capable of covering much of the problem space,
we find the addition of language-specific features still moderately improves classifier
performance. Orthogonal to the issue of portability, we also use the multiple corpora to
examine the consistency of feature performance across language versions.
While discussion concentrates on these novel aspects, we also implement a breadth
of features (65 in total). Performance measures, as detailed in Sec. 3.2, vary based on
language and task. The complete feature set produces cross-validation results consistent
with the state-of-the-art for English [2] and establishes novel performance benchmarks
for Spanish and German (PR-AUC=0.91, weighing languages equally). Though perfor-
mance varied considerably over the label-withheld PAN-CLEF 2011 test set, our ap-
proach took first-place in the associated competition, reinforcing its status as the most
accurate known approach to vandalism classification.

2 Feature Set
This section describes the features implemented. Discussion begins with a core feature-
set that is both zero-delay and language independent (Sec. 2.1). Then, two extensions to
that set are handled: ex post facto (Sec. 2.2) and language-specific (Sec. 2.3). Any fea-
ture which cannot be calculated directly from the provided corpus utilizes the Wikipedia
API2 . Readers should consult cited works to learn about the algorithms and parameters
of complex features (i.e., reputations and lower-order classifiers).

2.1 Zero-Delay, Language-Independent Features
Tab. 1 presents features that are: (1) zero-delay and (2) language-independent. Note that
features utilizing standardized language localization are included in this category (e.g.,
“User Talk” in English, is “Benutzer Diskussion” in German).
Nearly all of these features have been described in prior work [2,6], so their discus-
sion is abbreviated here. Even so, these signals are fundamental to our overall approach,
given that a single implementation is portable across all language versions. This is pre-
cisely why an extensive quantity of these features have been encoded.
1
http://en.wikipedia.org/wiki/Wikipedia:1.0
2
http://en.wikipedia.org/w/api.php
FEATURE DESCRIPTION
USR_IS_IP Whether the editor is anonymous/IP, or a registered editor
USR_IS_BOT Whether the editor has the “bot” flag (i.e., non-human user)
USR_AGE Time, in seconds, since the editor’s first ever edit
USR_BLK_BEFORE Whether the editor has been blocked at any point in the past
USR_PG_SIZE Size, in bytes, of the editor’s “user talk” page
USR_PG_WARNS Quantity of vandalism warnings on editor’s “user talk” (EN only)
USR_EDITS_* Editor’s revisions in last, t ∈ {hour, day, week, month, ever}
USR_EDITS_DENSE Normalizing USR_EDITS_EVER by USR_AGE
USR_REP Editor reputation capturing vandalism tendencies [10] (EN only)
USR_COUNTRY_REP Reputation for editor’s geo-located country of origin [10] (EN only)
USR_HAS_RB Whether the editor has ever been caught vandalizing [10] (EN only)
USR_LAST_RB Time, in seconds, since editor last vandalized [10] (EN only)
ART_AGE Time, in seconds, since the edited article was created
ART_EDITS_* Article revisions in last, t ∈ {hour, day, week, month, ever}
ART_EDITS_DENSE Normalizing ART_EDITS_EVER by ART_AGE
ART_SIZE Size, in bytes, of article after the edit under inspection was made
ART_SIZE_DELT Difference in article size, in bytes, as a result of the edit
ART_CHURN_CHARS Quantity of characters added or removed by edit
ART_CHURN_BLKS Quantity of non-adjacent text blocks modified by edit
ART_REP Article reputation, capturing vandalism tendencies [10] (EN only)
TIME_TOD Time-of-day at which edit was committed (UTC locale)
TIME_DOW Day-of-week on which edit was committed (UTC locale)
COMM_LEN Length, in characters, of the “revision comment” left with the edit
COMM_HAS_SEC Whether the comment indicates the edit was “section-specific”
COMM_LEN_NO_SEC Length, in chars., of the comment w/o auto-added section header
COMM_IND_VAND Whether the comment is one typical of vandalism removal
WT_NO_DELAY WikiTrust [1] score w/o ex post facto evidence (DE, EN only)
PREV_TIME_AGO Time, in seconds, since the article was last revised
PREV_USR_IP Whether the previous editor of the article was IP/anonymous
PREV_USR_SAME Whether the previous article editor is same as current editor
LANG_CHAR_REP Size, in chars., of longest single-character repetition added by edit
LANG_UCASE Percent of text added which is in upper-case font
LANG_ALPHA Percent of text added which is alphabetic (vs. numeric/symbolic)
LANG_LONG_TOK Size, in chars., of longest added token (per word boundaries)
LANG_MARKUP Measure of the addition/removal of wiki syntax/markup

Table 1. Zero-delay, language-independent features. Some features are not calculated for all lan-
guages. These are not fundamental limitations, rather, the source APIs are yet to extend support
(but trivially could). See Sec. 2.3 for discussion regarding features of the “LANG_*” form.

2.2 Leveraging Ex Post Facto Evidence
More novel is the utilization of ex post facto data in the classification task. To the
best of our knowledge, only the WikiTrust system of Adler et al. [1,2] has previously
described features of this type. Tab. 2 lists the ex post facto signals implemented in our
approach, which includes our own novel contributions (the first 4 features), as well as
those proposed and calculated by Adler et al. (the remainder).
EX POST FEAT. DESCRIPTION
USR_BLK_EVER Whether the editor has ever been blocked on the wiki
USR_PG_SZ_DELT Size change of “user talk” page between edit time and +1 hour
ART_DIVERSITY Percentage of recent revisions (±10 edits) made by editor
HASH_REVERT Whether article content hash-codes indicate edit was reverted
WIKITRUST WikiTrust [1] score with ex-post-facto evidence (DE, EN only)
WT_DELAY_DELT Difference in WIKITRUST and WT_NO_DELAY (DE, EN only)
NEXT_TIME_AHEAD Time, in seconds, until article was next revised
NEXT_USR_IP Whether the next editor of the article is an IP/anonymous editor
NEXT_USR_SAME Whether the next article editor is same as current editor
NEXT_COMM_VAND Whether the next “comment” indicates vandalism removal

Table 2. Ex-post-facto features: Leveraging evidence after edit save, but before evaluation.

No doubt, the strongest of these features is the WikiTrust score (WIKITRUST).
This captures the notion of reputation-weighted content-persistence: text that survives
is trustworthy, especially when the subsequent editors have good reputations. The Wik-
iTrust values we obtain are from a lower-order classifier, encompassing ≈70 data points.
However, it may be possible to improve upon or supplement the WikiTrust score.
First, WikiTrust is computationally intense, having to track word-level histories. Sec-
ond, content is sometimes removed or re-authored for reasons other than malicious
intent. Third, WikiTrust is not presently enabled for all languages. This motivated our
creation of feature HASH_REVERT, a more efficient and coarse-grained measure. The
hash-code is computed for the article version prior-to, and immediately-after, the edit
under inspection (scope is expanded if the editor makes multiple consecutive edits). If
the hashes match it indicates an identity revert, the wholesale removal of the editor’s
contributions, which is highly indicative of vandalism.
Another novel feature, USR_PG_SZ_DELT, captures that poor contributors are of-
ten notified/warned of their transgressions on their “talk page”. Informal analysis sug-
gested that German and Spanish versions lack the standardized warning system that
English employs [3]. Thus, a generic “size change” feature was implemented to detect
such talk page contributions.

2.3 On Language-Driven Features
When talking about language features, realize that is possible to produce language-
driven features that are not language-specific (i.e., generic properties). Examples in-
clude our features of the form LANG_*, as found at the bottom of Tab. 1. These mea-
sures are certainly applicable to the languages used herein (German, English, Spanish)
and analogues likely exist in many languages. However, these properties are unlikely
to be universal in nature. In particular, different character sets (e.g., Hindi, Chinese,
Japanese) might prove problematic, but this is ultimately outside the authors’ range of
expertise. It should be noted that languages similar to those under evaluation (i.e., use of
Latin characters, letter casing, space-delimited words, and Arabic numerals) represent
a significant portion of Wikipedia’s article space3 .
3
http://meta.wikimedia.org/wiki/List_of_Wikipedias_by_language_group
LANG-SPEC. FEAT. DESCRIPTION
{DE,EN,ES}_OFFEND Quantity of offensive terms added/removed by edit
*_OFFEND_IMPACT Normalizing *_OFFEND by ART_SIZE_DELT
{DE,EN,ES}_PRONOUN Quantity of 1st-person pronouns added/removed
*_PRONOUN_IMPACT Normalizing *_PRONOUN by ART_SIZE_DELT

Table 3. Features requiring natural-language customization. Each feature is implemented inde-
pendently, per-language. Spanish and German edits are also processed by the English versions.

While generic language features are portable, they lack the intuition of language-
specific ones. After all, profanity and slang have little place in encyclopedic content.
Not only are such measures intuitive, they are effective, as the 2010 PAN-CLEF win-
ning approach of Velasco [8] used multiple dictionaries (profanity, sexual terms, bi-
ased words, etc.). This is disheartening as such features: (1) lack portability, (2) can be
evaded with obfuscation, (3) require time-consuming implementation by fluent speak-
ers, and (4) tend to be computationally expensive. Velasco, however, did not include
many of the language-independent features we present in Tab. 1. Thus, as [2] sug-
gested, language-independent features might overlap and render language-specific ones
less critical. We extend that analysis here and do so across multiple natural languages.
Unfortunately, Velasco’s dictionaries are not open source and the German and Span-
ish equivalents must be implemented. Not NLP experts ourselves, we intend only to
create proof-of-concept and non-exhaustive language-specific features, as per Tab. 3.
This also allows us to perform cost-benefit analysis (i.e., the coverage of dictionaries
vs. the performance improvement) and motivates our decision to encode three different
approaches to compiling the offensive word lists (“offensive” here is just the combina-
tion of all undesirable text categories):

– S PANISH (ES): We re-purposed a scoring list designed for Spanish Wikipedia use4 .
The list contains 800+ manually constructed regexps of extensive complexity (cap-
turing intra-word permutations of diacritics, case, repeated letters, etc.). Manual
inspection removed regexps not specific to offensive terminology.
– E NGLISH (EN): A generic list of 1300+ offensive words (not regexps) is utilized5 .
The list is not Wikipedia-specific, but does enumerate conjugated verb forms.
– G ERMAN (DE): Unable to locate a dictionary of sufficient breadth, we decided to
examine the feasibility of a programmatic approach. We took the union of infor-
mal profanity lists and ran a stemming algorithm to produce roots which could be
searched for as embedded (i.e., non word-boundary delimited) regexp matches.

The text added and removed by an edit is scanned for word/regexp matches. The
number of matches are quantified (+1 for additions, -1 for removals) and these form the
{DE,EN,ES}_OFFEND features. The first-person “pronoun” features are straightfor-
ward and intend to capture bias in authorship and possible non-neutral points-of-view.
4
http://es.wikipedia.org/wiki/Usuario:AVBOT/Lista_del_bien_y_del_mal
5
http://www.cs.cmu.edu/~biglou/resources/
ENGLISH FEATURE # . . . FEATURE . . . # . . . FEATURE . . . #
WIKITRUST (F) 1 ART_SIZE_DELT 21 USR_LAST_RB 41
WT_DELAY_DELT (F) 2 USR_PG_SIZE 22 COMM_HAS_SEC 42
WT_NO_DELAY 3 ART_REP 23 ART_CHURN_CHARS 43
HASH_REVERT (F) 4 USR_PG_WARNS 24 COMM_IND_VAND 44
NEXT_COMM_VAND (F) 5 LANG_MARKUP 25 ART_CHURN_BLKS 45
USR_EDITS_MONTH 6 LANG_LONG_TOK 26 ART_EDITS_WEEK 46
USR_EDITS_WEEK 7 LANG_UCASE 27 ART_SIZE 47
USR_EDITS_EVER 8 EN_PRONOUN_IMPCT 28 ART_EDITS_DAY 48
USR_COUNTRY_REP 9 ART_EDITS_TOTAL 29 TIME_DOW 49
USR_EDITS_DENSE 10 USR_REP 30 ART_EDITS_HOUR 50
USR_IS_IP 11 ART_AGE 31 NEXT_USR_SAME (F) 51
USR_EDITS_DAY 12 LANG_ALPHA 32 USR_HAS_RB 52
USR_PG_SZ_DELT (F) 13 LANG_MARKUP 33 PREV_USR_IP 53
NEXT_TIME_AHEAD (F) 14 EN_PRONOUN 34 USR_BLK_EVER (F) 54
USR_AGE 15 ART_EDITS_DENSE 35 USR_BLK_BEFORE 55
COMM_LEN_NO_SEC 16 ART_DIVERSITY (F) 36 USR_IS_BOT 56
EN_OFFEND_IMPACT 17 LANG_CHAR_REP 37 NEXT_USR_IP (F) 57
USR_EDITS_HOUR 18 PREV_USR_SAME 38 TIME_TOD 58
EN_OFFEND 19 PREV_TIME_AGO 39
COMM_LEN 20 ART_EDITS_MONTH 40

Table 4. Kullback-Leibler divergence (i.e., information-gain) ranking for English features. Ex
post facto signals are indicated by “(F)” (but ranking is independent, so a zero-delay list would
have the same relative ordering). Foreign language features are not included for brevity.

3 Evaluation
This section describes and evaluates the machine-learning model built atop our feature
set. We begin by describing our choice of classification algorithm (Sec. 3.1). Then,
this model is used to evaluate feature effectiveness over the labeled training set, paying
particular attention to novel subsets (Sec. 3.2). Finally, we summarize performance over
the PAN-CLEF 2011 competition test set (Sec. 3.3).

3.1 Classification Model
The Weka [4] implementation of the alternating decision tree algorithm (ADTree) is
used for scoring/classification. This method was chosen because it: (1) produces human-
readable models, (2) handles missing features (API failures, missing data, etc.), and
(3) supports enumerated features (our strategy has many booleans). ADTrees have one
parameter of interest: the quantity of “boosting iterations” (i.e., tree-depth). German
and Spanish classifiers utilize 18 iterations and English uses 30, quantities arrived at
via cross-validation (the English training corpus [5] is 32× the size of the other two).

3.2 Training Set Evaluation
All results are produced via 10-fold cross-validation over the training corpus [5]. The
labels of the test corpus were withheld for the competition, as discussed in Sec. 3.3.
# GERMAN ENGLISH SPANISH
1 WT_NO_DELAY WT_NO_DELAY USR_EDITS_MONTH
2 USR_EDITS_EVER USR_EDITS_MONTH USR_EDITS_WEEK
(a) 3 USR_IS_IP USR_EDITS_WEEK USR_EDITS_EVER
4 USR_EDITS_MONTH USR_EDITS_EVER USR_IS_IP
5 USR_EDITS_WEEK USR_COUNTRY_REP ES_OFFEND_IMPACT
1 NEXT_COMM_VAND (F) WIKITRUST (F) NEXT_COMM_VAND (F)
2 WIKITRUST (F) WT_DELAY_DELT (F) NEXT_TIME_AHEAD (F)
(b) 3 WT_NO_DELAY WT_NO_DELAY HASH_REVERT (F)
4 HASH_REVERT (F) HASH_REVERT (F) USR_PG_SZ_DELT (F)
5 NEXT_USR_IP (F) NEXT_COMM_VAND (F) USR_EDITS_MONTH

Table 5. Extending Tab. 4 for all language corpora. Portion (a) permits only zero-delay
features, while portion (b) also includes ex post facto signals, as indicated by “(F)”.

Core Features and Cross-Language Consistency: We begin with the “core” set of
features (Tab. 1). Though these have been described in the past, their cross-language
evaluation is novel. Although space considerations prevent showing the full feature-
ranking for all languages (Tab. 5a), they are remarkably similar to those presented for
English (Tab. 4, ignoring “(F)” entries), especially when binned by the info-gain metric.
That is, a feature tends to be equally effective no matter the language of evaluation.
It is unsurprising that the zero-delay WikiTrust feature (WT_NO_DELAY) is the top-
performing feature where available (English, German) – it is a lower-order classifier that
wraps many data points. Beyond that, user participation statistics and registration status
are also dominant. Generic language features tend to perform moderately (not all edits
add content), with article-driven signals tending towards the bottom of the rankings.
While the feature ranking is not unexpected, the cross-language consistency has
stronger implications. It is a sociologically interesting observation that misbehavior is
characterized similarly across language and cultural boundaries. More technically, it
suggests the creation of language-independent classifiers might be feasible, eliminating
the need for new corpora to be amassed for each new Wikipedia edition.

Ex Post Facto Inclusion: As Tab. 5b demonstrates, the inclusion of ex post facto
features dramatically modifies the list of “best features,” with 4 of the top 5 being
of this type for all languages. Such signals also positively affect overall performance,
varying between 3.6% (English) and 13.6% (Spanish) PR-AUC increase (see Tab. 6).
While these improvements are not overwhelming, it should be emphasized that the high-
accuracy of zero-delay approaches decreases the possible margin for improvement.
These ex post facto features are redundant, however, all trying to capture the same
notion: “was the edit reverted?” (particularly WIKITRUST, NEXT_COMM_VAND, and
HASH_REVERT). While all are features of exemplary performance, they vary in effi-
ciency and robustness. For example, WikiTrust employs a complex but secure algorithm
that mines reputation from implicit Wikipedia actions. In contrast, NEXT_COMM_VAND
parses explicit summaries for keywords, which while simple, could easily be gamed.
The degree to which secure features are required is not immediately apparent. Vandals
are typically poorly incentivized [7] and therefore may not evade crude protections.
GERMAN ENGLISH SPANISH
METRIC RND ZD ALL RND ZD ALL RND ZD ALL
PR-AUC 0.302 0.878 0.930 0.074 0.773 0.801 0.310 0.868 0.986
ROC-AUC 0.500 0.958 0.981 0.500 0.963 0.968 0.500 0.946 0.993

Table 6. Area-under-curve (AUC) measurements for feature sets over training data. This is done
for precision-recall (PR) and receiver-operating characteristic (ROC) curves. Feature sets include
a control classifier (random, RND), zero-delay (ZD), and including ex post facto data (ALL).

LANG ZD-WO ZD-W DIFF% ALL-WO ALL-W DIFF%
(PR-AUC) DE 0.881 0.878 -0.34% 0.930 0.930 ±0.00%
(PR-AUC) EN 0.737 0.773 +4.89% 0.776 0.801 +3.22%
(PR-AUC) ES 0.805 0.868 +7.83% 0.988 0.986 -0.20%

Table 7. Measuring the impact of language-specific features (Tab. 3). Feature sets are evaluated
with (W) and without (WO) the inclusion of language-specific signals. Otherwise, acronyms are
as defined as in Tab. 6. PR-AUC is the singular metric used in this comparison.

Cost vs. Benefits of Language-Specific Signals: As Tab. 7 shows, the performance
benefit of language-specific features varies dramatically. They prove most helpful when
targeting zero-delay detection, and the extensiveness and expertise involved in creating
the “offensive word list” correlates with performance gains. Recall from Sec. 2.3 that
our German approach was quite crude (a stemming algorithm over informal profanity
lists). Such attempts did not translate positively, adding only noise to the classifier. At
the other extreme, a third-party, Wikipedia-customized, and complex set of regular-
expressions was able to increase zero-delay PR-AUC by nearly 8% in the Spanish case.
Where infrastructure already exists for these purposes, it can and should be re-
utilized (as we did for English and Spanish). Where it does not, it would seem casual
attempts should be avoided. More broadly, it seems wise to investigate autonomous
(and language-independent) means to produce robust dictionaries (e.g., n-grams).

Cumulative Performance: A broader viewer of classifier performance is presented
numerically in Tab. 6 and visualized in Fig. 1. One interesting observation is the vary-
ing performance between languages. English, despite having the most enabled features,
and 32× more training examples, is classified much poorer than Spanish and German.
At current, we have two hypotheses why this is the case. First, English has a tool called
the “Edit Filter” which prevents trivial vandalism from being saved6 (and becoming a
corpus member). We are unaware of any German/Spanish equivalent, meaning obvious
vandalism (i.e., “low-hanging fruit”) would be corpus members in those cases. Sec-
ond, vandalism tagging is a subjective process. The labeling of the English corpus was
done via Amazon Mechanical Turk [5] (utilizing random persons), whereas the smaller
German/Spanish versions involved Wikipedia researchers. The latter group is likely to
be more consistent in upholding the standards of the Wikipedia community, and such
agreement is particularly important for features like NEXT_COMM_VAND.
6
http://en.wikipedia.org/wiki/Wikipedia:Edit_Filter
Random Zero-Delay w/Ex-Post-Facto
1 1
0.8 0.8

precision
precision
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
(a) German (de) (b) English (en) (c) Spanish (es)

Figure 1. Precision-recall curves over training data.

# GERMAN ENGLISH SPANISH
1 WT_NO_DELAY EN_OFFEND_IMPACT ES_OFFEND_IMPACT
2 USR_EDITS_MONTH USR_PG_WARNS USR_IS_IP
(a) 3 ART_CHURN_CHARS WT_NO_DELAY TIME_TOD
4 USR_PG_SIZE USR_EDITS_MONTH LANG_UCASE
5 ART_SIZE_DELT LANG_UCASE PREV_USR_IP
1 NEXT_COMM_VAND (F) WIKITRUST (F) NEXT_COMM_VAND (F)
2 USR_IS_IP NEXT_COMM_VAND (F) USR_EDITS_WEEK
(b) 3 LANG_UCASE LANG_MARKUP NEXT_TIME_AHEAD (F)
4 LANG_ALPHA USR_COUNTRY_REP PREV_TIME_AGO
5 ART_CHURN_CHARS LANG_LONG_TOK LANG_LONG_TOK

Table 8. Top feature subsets of size n = 5, calculated using greedy step-wise analysis.
Portion (a) permits only zero-delay features; (b) includes ex post facto ones.

Regardless, English-language performance (the only known baseline) is comparable
to the state-of-the-art. That benchmark was set in our prior work [2], which this writing
re-implements with slight modifications. It should be emphasized that it was not our
intention to best that prior work, rather, we sought to use the expanded PAN-CLEF
2011 rules/corpora to analyze novel portions of the problem space.
Finally, it is interesting to produce the most effective feature subsets for each lan-
guage (Tab. 8). Unlike Tab. 5, this list considers feature correlation and overlap; dis-
playing the features weighted most heavily in the actual ADTree models. These or-
derings are quite unique compared to Tabs. 4 & 5, and greater analysis is needed to
determine what correlations give rise to these rule chains. For instance, English feature
LANG_MARKUP ranked 25th in info-gain, yet was the 3rd highest ranking in subset
form. Results like these imply a large degree of overlap between features, suggesting
that small (and therefore, efficient) feature sets/trees can produce accurate results.

3.3 Test Set Performance
When applied to the label-withheld test set, our model won the 2011 PAN-CLEF com-
petition. The PR-AUCs (EN= 0.706, EN= 0.822, ES= 0.489) show a slight perfor-
mance increase for English, but a dramatic drop for German/Spanish relative to cross-
validation over training data (Tab. 6). When the test corpus labels are revealed, they
should be inspected to see if some type of systematic bias gave rise to this discrepancy.
4 Conclusions
Our novel research directions in this paper were motivated by changes in the 2011
PAN-CLEF competition with respect to both the 2010 edition and the bulk of exist-
ing Wikipedia vandalism research. First, the competition permitted features to leverage
evidence after the edits were made. We identified multiple metrics of this type, which
were extremely effective, and whose implementation made clear the trade-off between
feature efficiency and robustness.
Second, the competition spanned three natural languages. For language-independent
features (i.e., metadata) this was the first non-English evaluation of such signals, though
relative order was found to be surprisingly consistent across languages. Multiple lan-
guages, however, imply costly localization for language-specific features (e.g., profanity
lists), forcing examination of their effectiveness. Including these atop an extensive set
of language-independent features, we find that minor-to-moderate contributions are still
possible, and the degree of improvement correlates with the localization’s complexity.
We hope that this work continues to promote and improve the autonomous detection
of vandalism. Such progress frees editors of monitoring roles and allows them to better
contribute to a growing body of collaborative knowledge.

Acknowledgements: This research was supported in part by ONR MURI N00014-07-1-0907.
The authors recognize those colleagues whose techniques were a component in the described
approach: B. Thomas Adler, Luca de Alfaro, Santiago Mola-Velasco, Sampath Kannan, Ian Pye,
and Paolo Rosso (see [2]). Andreas Haeberlen is thanked for his German language assistance.
Martin Potthast is acknowledged for his continued dedication to the vandalism detection task.

References
1. Adler, B.T., de Alfaro, L.: A content-driven reputation system for the Wikipedia. In:
WWW’07: Proc. of the 16th International World Wide Web Conference (May 2007)
2. Adler, B., de Alfaro, L., Mola-Velasco, S.M., Rosso, P., West, A.G.: Wikipedia vandalism
detection: Combining natural language, metadata, and reputation features. In: CICLing’11
(Comp. Linguistics and Intelligent Text Processing) and LNCS 6609 (February 2011)
3. Geiger, R.S., Ribes, D.: The work of sustaining order in Wikipedia: The banning of a
vandal. In: CSCW’10: Proc. of the Conf. on Computer Supported Cooperative Work (2010)
4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witen, I.H.: The WEKA
data mining software: An update. SIGKDD Explorations 11(1) (2009)
5. Potthast, M.: Crowdsourcing a Wikipedia vandalism corpus. In: SIGIR‘10: Proc. of the
33rd International ACM SIG Information Retrieval Conference. pp. 189–790 (2010)
6. Potthast, M., Stein, B., Holfeld, T.: Overview of the 1st International competition on
Wikipedia vandalism detection. In: PAN-CLEF 2010 Labs and Workshops (2010)
7. Priedhorsky, R., Chen, J., Lam, S.K., Panciera, K., Terveen, L., Riedl, J.: Creating,
destroying, and restoring value in Wikipedia. In: ACM GROUP’07 (2007)
8. Velasco, S.M.M.: Wikipedia vandalism detection through machine learning: Feature review
and new proposals. Tech. rep., Lab Report for PAN at CLEF 2010 (2010)
9. West, A.G., Chang, J., Venkatasubramanian, K., Lee, I.: Trust in collaborative web
applications. Future Generation Comp. Sys. section on Trusting Software Behavior (2011)
10. West, A.G., Kannan, S., Lee, I.: Detecting Wikipedia vandalism via spatio-temporal
analysis of revision metadata. In: EUROSEC’10: European Wkshp. on Sys. Security (2010)