<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Novel Balanced Feature Representation for Wikipedia ? Vandalism Detection Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>István Hegedu ̋ s</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Szeged and Hungarian Academy of Sciences</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Szeged</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>In online communities, like Wikipedia, where content edition is available for every visitor users who deliberately make incorrect, vandal comments are sure to turn up. In this paper we propose a strong feature set and a method that can handle this problem and automatically decide whether an edit is a vandal contribution or not. We present a new feature set that is a balanced and extended version of the well known Vector Space Model (VSM) and show that this representation outperforms the original VSM and its attribute selected version as well. Moreover, we describe other features that we used in our vandalism detection system and a parameter estimation method for a weighted voting metaclassifier.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this paper we propose a common feature set for the task, which contains basic,
word based features and complex bag-of-word based, optimized features as well. After,
we introduce a voting based classifier method. We propose a method that helps fine tune
the parameters of this meta-classifier avoiding the overfitting during the model building
phase. Basically, this study is an overview of our system, which was applied in the PAN
2010 Wikipedia Vandalism Detection shared task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Machine learning based vandalism detection approach</title>
      <p>To solve the Wikipedia vandalism detection problem we decided to propose an
inferring method, which is as automatic as possible. This method follows the traditional
NLP approach: first, we extract features from the published training set, then, we apply
statistical learning algorithms to produce a model which can be used for automatically
labeling the evaluation set. The main idea behind our features was to try to capture the
vandalism class as much as possible, since the regular one is too general. The
description of the features is the following:
– BalancedVSM (BVSM)</p>
      <p>This feature is a specialization of the VSM where the vector of a certain document
contains only 0 or 1 values for each dimension. In our case, we used 4 different
values as vector elements:
– when the edit does not contain the word, then the value is n,
– when the word is in an added sequence, then the value is a,
– when the word is in a removed sequence, then the value is d,
– when the word is in a changed sequence, then the vale is c.</p>
      <p>
        As it is well known using this type of VSM representation is not very successful [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
due to the fact that the dimension of this representation is 47 324. And so we had
to apply a dimension reduction method, which is based on the InfoGain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] score.
Our preliminary observations showed that choosing the top 100 attributes results in
better representation.
      </p>
      <p>Since the distribution of the regular and vandalism samples is totally unbalanced
(∼93.86% regular samples), the above described attribute extraction step
overrepresents the words from the regular edits. Having seen this problem we balanced
the VSM representation: initially, we selected all the samples which were
classified as vandalism. Then, we iteratively added to this set randomly selected, regular
samples of the same quantity. Next, we performed the previously described
dimensional reduction step and we stored and summed up the InfoGain scores of the top
100 attributes. Finally, we selected the top 100 attributes based on the aggregated
scores and used them as Balanced SVM attributes.
– CharacterStatistic</p>
      <p>This feature family involves two different attributes: the upper case letter and the
non-letter character occurrences divided by the number of characters respectively.
– RepeatedCharSequences</p>
      <p>One of the signs of vandalism is when somebody just repeats a sort string e.g.
”asdasdasdasdasd“. For this reason, we scanned the modifications and the comments
to find short and frequent repeats.
– ValidWordRatio</p>
      <p>In these two attributes, we used dictionaries to provide the feature values. We used
a simple English language dictionary and another that contains pejorative English
expressions. Finally, the feature values are the number of the word occurrences in
the dictionaries normalized by the word occurrences in the given edit.
– CommentStatistic (non-edit based feature)</p>
      <p>Commenting on the modification is made available for each user who edits Wikipedia
pages. The possible feature values are:
– deleted if the comment of the edit was deleted,
– comment if the user has written into the comment field,
– nothing in any other cases.
– UserNameOrIP (non-edit based feature)</p>
      <p>The user who edits Wikipedia can either register and choose a nickname or not
register and use his IP number. So we added a feature that describes whether a user
is registered or not.</p>
      <p>
        Based on the previously defined features, we built several models applying different
learning algorithms. These algorithms are quite common, and their implementations
are available from several sources. We used the WEKA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implementations in our
experiments. The algorithms used and their short descriptions are the following:
– ZeroR: the most frequent class classifier.
– NaïveBayes : Bayes’ theorem based classifier.
– J48: one of the decision tree learning methods.
– LogReg: the maximum likelihood based Logistic Regression method.
– SMO: an implementation of the SVM classifier.
– WeightedVotingMetaclassifier: This classifier combines several underlying
classifier algorithms based on a weighted aggregation. This algorithm is the only one
which is not an official WEKA algorithm since we had to develop it as a WEKA
extension.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In this section we describe the evaluations we made and present our final result
measured on the released evaluation set. Since we knew the correct class labels for the
training set only, we could only use this information for evaluation. We used the AUC
score as evaluation metric.
3.1</p>
      <p>Evaluation of different VSMs
In our first evaluation, we investigated the representation power of different VSMs.
The overall results can be seen in Figure 1. Here, we evaluated several classifiers on the
Normal VSM feature set, which is a simple VSM representation of the edits. The second
feature set is an Attribute Selected VSM, where we retained the top 100 InfoGain scored
VSM features. As one can see this is a much stronger feature set, but by using the
Balanced Attribute Selected VSM, we can achieve a higher 10-fold AUC score in the
case of each classifier. For this reason, we chose this representation as the base VSM
representation in later evaluations.
0.4</p>
      <p>Normal VSM</p>
      <p>Attribute Selected VSM</p>
      <p>Balanced Attribute Selected VSM
ZeroR
Our second evaluation focused on the examination of the relation of the defined feature
sets to the chosen training algorithms. We defined some feature sets as the subsets of
the previously defined feature ranges and performed some learning-evaluation phases
applying different training algorithms. The results are summarized in Table 1.</p>
      <p>In Table 1, the semantics of the feature set labels is the following: ”BSVM“ means
the Balanced VSM representation; ”BSVM, stop“ is the same as the previous except
that we added a stop world list, which ignores the meaningless words; ”All features,
stop“ represents the feature set, where we used all the above defined features and in the
case of BVSM the stop world list was also used.</p>
      <p>The most reasonable models are the probability based ones, however in the case of
the last feature set (All features, stop) the J48 algorithm, which is a clearly
discriminative model based approach, also shows pretty good AUC result.</p>
      <p>In the case of the last feature set, the fact that one of the discriminative model could
achieve a similar result than the probability based approaches indicated that this feature
set is quite stable. In our further experiments we used this feature set.
1 0
0.2</p>
      <p>1
0.8
0.40N.6aiveBayes</p>
      <p>From the fact that these algorithms work in a completely different way, we assumed
that perhaps the algorithms, which were based on different approaches, made different
errors. From this naturally comes that idea that we should try to combine the three best
algorithms namely the J48, the NaïveBayes and the LogReg, and built the previously
introduced Weighted Voting Metaclassifier on the top of these three algorithms. The
only questions here are How should we determine the weights of the underlying
classifiers? and Are the optimal weights found in the training set optimal on the evaluation
set as well?
We decided that we use the 10-fold AUC scores as the optimality measurement of a
weight setting. To check the validity of our optimization process, we splitted the training
set into an optimization and an evaluation set by the ratio of 4:1. We performed a
10fold cross validation based optimization of the parameters on the optimization set and
we checked whether this selection is optimal on the evaluation set or not. We performed
this optimization in the whole parameter space. The summary of our optimization can
be seen in Figure 2. In this figure, the x-axis and y-axis represent the weight of the
J48 and Naïve Bayes classifiers respectively. Since the weights must be normalized,
the weight of the LogReg model can be calculated as (1−weight of J48 − weight of
NaïveBayes ).</p>
      <p>As one can see, the results of the evaluation set and the results of the optimization
sets correlate (the two surfaces are almost the same). So we can say that these
optimization criteria are valid and we found that the optimal weighting of the algorithms is the
following (J 48 : 0.3; N aiveBayes : 0.09; Logistic : 0.61).</p>
      <p>The achieved AUC 10-fold cross validation based score of the optimally weighted
metaclassifier is 0.9129, which is significantly higher than the best score in Table 1.
Thus, we used this combined Weighted Voting Metaclassifier model (, which learned
on the full train set and used the weights presented above) for making our final
predictions on the official evaluation set. Our result makes 0.87669 error score on the official
evaluation set of vandalism detection task.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Our experiments in the field of detecting vandalism in Wikipedia edits indicated that
we should participate in the Wikipedia Vandalism Detection Shared Task. Although our
solution made an average performance on the challenge, we feel that our work has a
strong contribution. This contribution is twofold. First, we developed a strong feature
representation for the task, which can be built in a fully automatic manner and some
of these features are pretty complex e.g. the Balanced VSM representation, which is
a novel extension of the basic VSM representation and is suitable for learning tasks
where the class labels have a highly unbalanced distribution. Second, we successfully
combined classification methods in a weighted manner, where the weights had been
optimized as hyperparameters.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Pan 2010 lab, task 2: Wikipedia vandalism detection</article-title>
          , http://www.uniweimar.de/medien/webis/research/workshopseries/pan-10/task2-vandalism-detection.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Belani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Vandalism detection in wikipedia: a bag-of-words classifier approach</article-title>
          .
          <source>CoRR abs/1001</source>
          .0700 (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          :
          <source>Pattern Recognition and Machine Learning (Information Science and Statistics)</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The weka data mining software: an update</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with support vector machines: learning with many relevant features</article-title>
          . In: Nédellec,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rouveirol</surname>
          </string-name>
          , C. (eds.)
          <source>Proceedings of ECML-98, 10th European Conference on Machine Learning</source>
          . pp.
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          . Springer, Heidelberg et al. (
          <year>1998</year>
          ), http://citeseer.ist.psu.edu/joachims97text.html
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Maron</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuhns</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>On relevance, probabilistic indexing and information retrieval</article-title>
          .
          <source>J. ACM</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <fpage>216</fpage>
          -
          <lpage>244</lpage>
          (
          <year>1960</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A comparison of event models for naive Bayes text classification</article-title>
          .
          <source>In: Learning for Text Categorization: Papers from the 1998 AAAI Workshop</source>
          . vol.
          <volume>752</volume>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          (
          <year>1998</year>
          ), http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Zhang</surname>
            , T.,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Oles</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Text categorization based on regularized linear classification methods</article-title>
          .
          <source>Inf. Retr</source>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>31</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>