Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia Maik Anderka and Benno Stein Web Technology & Information Systems Bauhaus-Universität Weimar, Germany pan@webis.de http://pan.webis.de Abstract The paper overviews the task “Quality Flaw Prediction in Wikipedia” of the PAN’12 competition. An evaluation corpus is introduced which comprises 1 592 226 English Wikipedia articles, of which 208 228 have been tagged to con- tain one of ten important quality flaws. Moreover, the performance of three qual- ity flaw classifiers is evaluated. 1 Introduction The online encyclopedia Wikipedia is one of the largest and most popular user- generated knowledge sources on the Web. Some facts: Wikipedia contains articles from more than 280 languages, the English Wikipedia version contains about 4 million ar- ticles, the Wikipedia community involves more than 35 million registered editors, and wikipedia.org ranks among the top ten most visited Web sites.1 Probably the biggest challenge for Wikipedia pertains to the quality of its articles, since the community of Wikipedia authors is heterogeneous and since contributions to Wikipedia are not re- viewed by experts before their publication. Both the size and the dynamic nature of Wikipedia render a comprehensive manual quality assurance infeasible. A variety of approaches to automatically assess quality in Wikipedia has been pro- posed in the relevant literature, see e.g. [13, 7, 6, 11, 15]. However, the practical support for Wikipedia’s quality assurance process is marginal, as these approaches provide no rationale governing the respects in which an article violates Wikipedia’s quality stan- dards. There are only a few prior studies that target the identification of specific quality flaws, and these studies either investigate only small samples of articles [14] or analyze only a restricted set of flaws [1, 10]. Anderka et al. [3, 4] are the first who provide a comprehensive breakdown of quality flaws in Wikipedia. Their analysis reveals among others that 27.52% of the English Wikipedia articles contain at least one quality flaw, and that 70% of the flaws concern article verifiability. The analysis is based on human- tagged articles, so that the actual number of flaws is expected to be even higher: it is more than likely that many flawed articles have not yet been identified. The outlined facts make clear that the automated prediction of quality flaws in Wi- kipedia is a relevant problem, and the research on and the development of respective prediction approaches are the main goals of this PAN’12 task. 1 Wikimedia, http://meta.wikimedia.org/wiki/List_of_Wikipedias. Alexa Internet, Inc., http://www.alexa.com/siteinfo/wikipedia.org. 1.1 Quality Flaw Prediction We cast quality flaw prediction in Wikipedia as a one-class classification problem, as proposed in [2] and [5]: Given a set of Wikipedia articles that are tagged with a partic- ular quality flaw, decide whether an untagged article suffers from this flaw. Stated formally, let D be the set of Wikipedia articles and let F be a set of quality flaws. We model the classification cf (d) of an article d ∈ D with respect to a quality flaw f ∈ F as the following one-class classification problem: Decide whether or not d contains f , whereas a sample of articles containing f is given. cf : D → {1, 0} is a specific classifier for flaw f , d denotes the (vector) representation or document model of article d, and D denotes the set of document models for the Wikipedia articles D. A key challenge of this problem is the absence of representative “negative” training data (articles that are tagged to not contain a particular flaw)—a fact which renders com- mon discrimination-based classification techniques such as binary or multiclass classifi- cation inapplicable. The feature engineering, i.e., the development of document models that discriminate articles containing a certain flaw from all other articles is hence one of the primary challenges. 1.2 Evaluating Quality Flaw Classifiers The acquisition of sensible test data to evaluate a classifier cf is intricate in the Wiki- pedia setting; see [5] for an in-depth discussion. Major problem is that no articles are available that have been tagged to not contain a quality flaw f ∈ F . Thus cf can be evaluated only with respect to its recall. For most relevant use cases, however, precision is the indicated measure of effectiveness; consider for instance a bot that autonomously tags flawed articles in Wikipedia. In order to evaluate a classifier cf with respect to its precision one needs a representative sample of articles from outside the target class of f , so-called outliers. The authors of [5] propose two strategies to derive examples from outside the target class: (1) the use of featured articles, which is based on the hypothesis that featured articles do not contain a quality flaw at all (optimistic setting), and (2) the use of ran- dom articles that have not been tagged with f (pessimistic setting). Here, we employ a combined strategy and evaluate the quality flaw classifiers using featured articles and random articles as outlier examples. 2 Evaluation Corpus Wikipedia users who encounter a flaw may tag the affected article with a so-called cleanup tag.2 The available cleanup tags correspond to the set of quality flaws that have been identified so far by Wikipedia users, and the tagged articles provide a source of human-labeled data—an idea that has been proposed in [1]. The task here targets the prediction of ten quality flaws, listed in Table 1. The rationale for the selection of this flaw subset are twofold: (1) these flaws are considered to be the most important flaws 2 An overview of cleanup tags in the English Wikipedia: http://en.wikipedia.org/ wiki/Wikipedia:Template_messages/Cleanup. Table 1. The ten most important article flaws in the English Wikipedia along with a description. Flaw name Description Unreferenced The article does not cite any references or sources. Orphan The article has fewer than three incoming links. Refimprove The article needs additional citations for verification. Empty section The article has at least one section that is empty. Notability The article does not meet the general notability guideline. No footnotes The article’s sources remain unclear because of its inline citations. Primary sources The article relies on references to primary sources. Wikify The article needs to be wikified (internal links and layout). Advert The article is written like an advertisement. Original research The article contains original research. [5], and (2) these flaws have been used in previous work [2, 5], which makes the results of this task comparable. The evaluation corpus is based on the English Wikipedia snapshot from January 4, 2012.3 The corpus contains for each of the ten quality flaws Wikipedia articles that are exclusively tagged with the respective cleanup tag. The corpus contains also untagged articles, which have not been tagged with any cleanup tag. Altogether 1 592 226 articles are provided from which 208 228 are tagged and 1 383 998 are untagged.4 For the PAN competition, the corpus is divided into a training corpus and a test corpus.5 The training corpus contains tagged articles for each of the ten quality flaws plus additional 50 000 untagged articles; in the training corpus the respective labels are given. In particular, tagged articles may be considered as “positive” training examples while untagged articles may be considered as outlier examples to evaluate and tune the classifiers. In case of a semi-supervised learning approach, the untagged articles serve as additional training examples. The test corpus contains a balanced number of tagged articles and untagged articles for each of the ten quality flaws; in the test corpus the labels are omitted. Moreover, it is ensured that 10% of the untagged articles are featured articles in order to address both the optimistic and the pessimistic setting, mentioned in Section 1.2. 3 Overview and Evaluation of Flaw Prediction Approaches This section briefly overviews the submitted quality flaw prediction approaches and reports on their evaluation. From 21 registered teams three teams submitted runs for this task, see Table 2. Feretti et al. [8] and Ferschke et al. [9] submitted a report describing their quality flaw classifiers, while Pistol and Iftene provided a brief description. 3 Wikimedia downloads: http://dumps.wikimedia.org/enwiki/20120104. 4 The corpus is available at http://www.webis.de/research/corpora. 5 For details about the size and composition of the corpora see http://www.webis.de/ research/events/pan-12/pan12-web/wikipedia-quality.html. Table 2. Participating teams of the 1st International Competition on Quality Flaw Prediction in Wikipedia. Team name Participants and affiliations Ferretti et al. Edgardo Ferretti? , Donato Hernández Fusilier◦ , Rafael Guzmán Cabrera◦ , Manuel Montes-y-Gómez† , Marcelo Errecalde? , and Paolo Rosso‡ ? Universidad Nacional de San Luis, Argentina ◦ Universidad de Guanajuato, Mexico † Óptica y Electrónica (INAOE), Mexico ‡ Universidad Politécnica de Valencia, Spain Ferschke et al. Oliver Ferschke, Iryna Gurevych, and Marc Rittberger Technische Universität Darmstadt, Germany Pistol and Iftene Ionut Cristian Pistol and Adrian Iftene “Alexandru Ioan Cuza” University of Iasi, Romania 3.1 Features and Classifiers Feretti et al. apply PU learning, which is a semi-supervised learning paradigm proposed by Liu et al. [12]. The algorithm is implemented as a two-step strategy: (1) a set of so- called “reliable negatives” is identified from the set of untagged articles, and (2) the reliable negatives and the tagged articles are used to train a binary classifier. Feretti et al. employ a Naive Bayes classifier within the first step and a Support Vector Machine within the second step. Their document model is based on 73 features; the features form a subset of the features proposed in [5]. For each of the ten flaws the same document model is used. Ferschke et al. regard the problem as a binary classification task, using the tagged ar- ticles as positive instances and the untagged articles as negative instances. They employ two machine learning approaches, namely a Naive Bayes classifier and C4.5 decision trees. Their document model is based on 32 feature types. In particular, a dedicated document model is used for each flaw, which is determined by a features selection ap- proach. Instead of using machine learning, Pistol and Iftene resort to a rule-based approach. They define a particular set of rules for each flaw and classify an article as flawed if it fulfills the formulated requirements. 3.2 Evaluation The quality flaw classifiers are evaluated for each of the ten flaws individually. To de- termine the winning classifier, the prediction performance is judged by averaging pre- cision, recall, and F-measure over all ten quality flaws. Table 3 shows the prediction performance of the quality flaw classifiers. The classifier of Feretti et al. performs best in terms of the averaged F-measure and the averaged recall. The classifier of Ferschke et al. achieves a slightly higher averaged precision, but a much lower averaged recall. The third classifier of Pistol and Iftene falls far behind because of a very low averaged precision. The situation is nearly the same for the individual flaws: except for the flaw Wikify, Feretti et al. achieve in general a higher Table 3. Performance of the quality flaw predictors in terms of precision, recall, and F-measure. Flaw name Team name Precision Recall F-measure Unreferenced Ferretti et al. 0.744731 0.954000 0.836475 Ferschke et al. 0.780229 0.884000 0.828880 Pistol and Iftene 0.056462 1.000000 0.106889 Orphan Ferretti et al. 0.830365 0.979000 0.898577 Ferschke et al. 0.862873 0.925000 0.892857 Pistol and Iftene 0.016669 0.241000 0.031181 Refimprove Ferretti et al. 0.734848 0.970000 0.836207 Ferschke et al. 0.614566 0.751000 0.675968 Pistol and Iftene 0.034962 0.357000 0.063687 Empty section Ferretti et al. 0.741546 0.921000 0.821588 Ferschke et al. 0.876081 0.912000 0.893680 Pistol and Iftene 0.056462 1.000000 0.106889 Notability Ferretti et al. 0.739655 0.858000 0.794444 Ferschke et al. 0.661491 0.852000 0.744755 Pistol and Iftene 0.055024 0.477000 0.098666 No footnotes Ferretti et al. 0.720446 0.969000 0.826439 Ferschke et al. 0.730364 0.902000 0.807159 Pistol and Iftene 0.034518 0.170000 0.057384 Primary sources Ferretti et al. 0.716615 0.923000 0.806818 Ferschke et al. 0.735769 0.866000 0.795590 Pistol and Iftene 0.052055 0.423000 0.092702 Wikify Ferretti et al. 0.742195 0.737000 0.739589 Ferschke et al. 0.677912 0.844000 0.751893 Pistol and Iftene 0.056462 1.000000 0.106889 Advert Ferretti et al. 0.736133 0.929000 0.821397 Ferschke et al. 0.853306 0.826000 0.839431 Pistol and Iftene 0.046575 0.582000 0.086248 Original research Ferretti et al. 0.647462 0.930966 0.763754 Ferschke et al. 0.739544 0.767258 0.753146 Pistol and Iftene 0.022903 0.542406 0.043951 Averaged over all flaws Ferretti et al. 0.735400 0.917097 0.814529 Ferschke et al. 0.753213 0.852926 0.798336 Pistol and Iftene 0.043209 0.579241 0.079449 recall than Ferschke et al. For seven of the ten quality flaws Ferschke et al. achieve the highest precision. However, in terms of the F-measure the classifier of Feretti et al. performs best for seven of the ten quality flaws. 4 Conclusion The results of the 1st International Competition on Quality Flaw Prediction in Wikipe- dia can be summarized as follows: three quality flaw classifiers have been developed, which employ a total of 105 features to quantify the ten most important quality flaws in the English Wikipedia. Two classifiers achieve promising performance for particu- lar flaws. An important “by-product” of the competition is the first corpus of flawed Wikipedia articles, the PAN Wikipedia quality flaw corpus 2012 (PAN-WQF-12). Acknowledgement We thank the German chapter of the Wikimedia Foundation, Wikimedia Deutschland, for sponsoring the price for the winning team. Bibliography [1] M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proceedings of the 20th international conference on World Wide Web (WWW 2011), pages 5–6, 2011. [2] M. Anderka, B. Stein, and N. Lipka. Detection of text quality flaws as a one-class classification problem. In Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2011), pages 2313–2316, 2011. [3] M. Anderka and B. Stein. A breakdown of quality flaws in Wikipedia. In Proceedings of the 2nd joint WICOW/AIRWeb workshop on Web quality (WebQuality 2012), pages 11–18, 2012. [4] M. Anderka, B. Stein, and M. Busse. On the evolution of quality flaws and the effectiveness of cleanup tags in the English Wikipedia. In Wikipedia Academy 2012 (WPAC 2012), 2012. [5] M. Anderka, B. Stein, and N. Lipka. Predicting quality flaws in user-generated content: the case of Wikipedia. In Proceedings of the 35th international ACM conference on research and development in information retrieval (SIGIR’12), pages 981–990, 2012. [6] J. Blumenstock. Size matters: word count as a measure of quality on Wikipedia. In Proceedings of the 20th international conference on World Wide Web (WWW 2008), pages 1095–1096, 2008. [7] D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proceedings of joint conferences on digital libraries (JCDL 2009), pages 295–304, 2009. [8] E. Ferretti, D. H. Fusilier, R. G. Cabrera, M. Montes-y-Gómez, M. Errecalde, and P. Rosso. On the use of PU Learning for quality flaw prediction in Wikipedia: notebook for PAN at CLEF 2012. In Notebook Papers of CLEF 2012 LABs and Workshops, 2012. [9] O. Ferschke, I. Gurevych, and M. Rittberger. FlawFinder: a modular system for predicting quality flaws in Wikipedia: notebook for PAN at CLEF 2012. In Notebook Papers of CLEF 2012 LABs and Workshops, 2012. [10] L. Gaio, M. den Besten, A. Rossi, and J. Dalle. Wikibugs: using template messages in open content collections. In Proceedings of the 5th symposium on wikis and open collaboration (WikiSym 2009), pages 14:1–14:7, 2009. [11] M. Hu, E. Lim, A. Sun, H. Lauw, and B. Vuong. Measuring article quality in Wikipedia: models and evaluation. In Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2007), pages 243–252, 2007. [12] B. Liu, Y. Dai, X. Li, W. S. Lee and P. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003), pages 179–186, 2003. [13] N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proceedings of the 20th international conference on World Wide Web (WWW 2010), pages 1147–1148, 2010. [14] B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Information quality work organization in Wikipedia. Journal of the american society for information science and technology, 59(6):983–1001, 2008. [15] D. Wilkinson and B. Huberman. Cooperation and quality in Wikipedia. In Proceedings of the 3rd symposium on wikis and open collaboration (WikiSym 2007), pages 157–164, 2007.