Enhancing MediaWiki Talk pages with Semantics for Better Coordination? A Proposal Jodi Schneider, Alexandre Passant, John G. Breslin?? Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org Abstract. This paper presents a 15-item classification for MediaWiki Talk pages comments, associated with a new lightweight ontology that extends SIOC to represent these categories. We discuss how this ontology can enhance MediaWiki Talk pages, with RDFa, making content of such pages easier to parse and to understand. Key words: MediaWiki, Wikipedia, Talk pages, RDFa, SIOC 1 Introduction Wikis are often used for collaborative knowledge gathering and sharing, and coordination of this work may take place on and off the wiki (e.g. [8]). How- ever, finding relevant conversations may become more difficult as their volume increases. MediaWiki software1 , used by Wikipedia, Wikia2 , and other wikis, is one of the most popular systems, and we focus on it throughout the paper. Article- level coordination is common in MediaWiki; by default, MediaWiki installations provide a Talk namespace. Each article links to a Talk page (originally empty), which can be used to coordinate, discuss, and dispute the editing of that article. Figure 1 shows a sample Talk page. Talk pages are heavily used (as we discuss in Section 2.1), and some improvements to Talk pages have already been made available as MediaWiki plugins3,4 . We believe that Talk pages could benefit from increased semantics. As Talk pages grow, MediaWiki editors may benefit from tools to help iden- tify relevant comments. We provide sample RDFa markup for MediaWiki Talk ? The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lı́on-2). ?? John G. Breslin is also member of the School of Engineering and Informatics, NUI Galway 1 http://www.mediawiki.org/ 2 http://www.wikia.com/ 3 http://www.mediawiki.org/wiki/Extension:LiquidThreads 4 http://www.mediawiki.org/wiki/Category:Discussion_and_forum_extensions 2 Jodi Schneider, Alexandre Passant, John G. Breslin Fig. 1. Talk page for the Semantic Web article in Wikipedia pages, using a lightweight ontology for Talk page comments which extends SIOC [2]. This markup and ontology provide underlying metadata which could later be used to highlight and query for certain types of Talk page comments. In the remainder of the paper, we first review related work, then describe 15 categories used to classify comments on MediaWiki Talk pages. Next we distill that classification system to a lightweight ontology for relevant Talk page comments, which we use to markup a Talk page segment in RDFa. Finally we outline work in progress on leveraging this ontology with RDFa markup and JavaScript- and SPARQL-based tools. 2 Related Work 2.1 Talk pages are heavily edited on Wikia and Wikipedia Based on their studies of Wikia, Aniket & Kittur postulate that article talk scales linearly with the size of the wiki [5]. They compare coordination and Talk pages of Wikipedia and over 6000 Wikia wikis, finding differences which they attribute to differences in community size and type. Wikipedia’s Talk pages are heavily used, and in recent years, Talk pages have been added more quickly than articles, growing at a rate of 11x, compared to 9x for articles [11]. Over a 2.5 year period, edits to Wikipedia Talk pages nearly doubled, from 11% to 19% of all page edits, while article edits nearly halved Enhancing MediaWiki Talk pages with Semantics for Better Coordination 3 from 53% to 28% of all page edits [10]. Further, Wikipedia’s users make a larger or smaller percentage of edits to Talk pages depending on their social roles [12]. 2.2 Studies of Wikipedia Talk pages While Wikipedia Talk pages have been studied from a content analysis, commu- nications theory, and data mining perspective, further research is needed because the variance between Talk pages is significant. For instance, the most common type of discussion, coordination requests (described in Section 3 below), ranges widely, from 2% to 97% of the comments on a page, depending on the page [11]. Due to the variance, perhaps it is not surprising that researchers do not agree on the second most common type of discussion [3][11]. However, despite the evident variance, few categorical differences between Talk pages have been identified or systematically described. Furthermore, sample sizes for qualitative studies have been small (see [10] for a comparison of Featured and non-Featured articles with the largest sample size, 60 Talk pages). Other studies of Talk pages include [6], [4], [1], and [3]. Viégas [11] provides both a manual classification of 25 hand-selected Talk pages, and a quantitative analysis, which reveals that articles with Talk pages are more highly edited, and have more editors than articles without Talk pages. In particular,“94% of the pages with more than 100 edits have related Talk pages”. The dimensions used in their manual classification are further discussed in Section 3, where they form the basis for our lightweight ontology. 3 Classifying comments in Wikipedia Our classification began organically from the items in Talk pages we reviewed for our content analysis [9]. These coalesced into a set of classifications, which we then compared with the classification frameworks used in [11] and [10]. Since we planned to develop an ontology for editors to apply to their own comments, the directness of Viégas’ classifications suited us, especially since these had already been used for at least two studies, and were very similar to our own classification. By contrast, since Stvilia classifies the possible information quality problems of an article, his classifications (such as cohesiveness and verifiability) require more abstraction, since they describe attributes of the article, not of the comment; further, some terms, (such as semantic consistency and security) might not be instantly accessible to the lay reader and wiki editor. To update and extend Viégas’ analysis [11], we undertook a manual content analysis [9] of Talk page comments, based on 100 Talk pages from five differ- ent types of Wikipedia Talk pages. Our content analysis used 15 non-mutually- exclusive classifications. First, we used the 11 classifications defined by Viégas [11]; Table 1 shows definitions of each term, with examples taken from Wikipedia Talk pages that we analyzed. To capture other features we were interested in, we added 4 new, non-mutually-exclusive classifications as shown in Table 2. We added these types because: 4 Jodi Schneider, Alexandre Passant, John G. Breslin Classification Definition Example Requests/suggestions Ideas, comments, or sugges- Currently some of the refs for editing coordination tions involving editing the are YYYY-MM-DD format article. and some are Month DD, YYYY. Which format do we want to standardize to? Requests for informa- Questions asked by someone Where is Ligurian spoken in tion who doesn’t intend to edit the Var ? the page. References to vandalism Mentions of vandalism. I’ve semi-protected the ar- ticle for another week, the signal-to-noise ratio of the IP edits seemed too low. References to wiki References to guidelines The section I removed had guidelines and policies and/or policies of this wiki. no sources / references - if you have sources they’re no good being kept a secret ;) WP:VERIFY, WP:CITE. Thanks/ References to internal References to internal wiki Would it be a good thing to wiki resources resources such as diffs, Talk re-add the links that were page discussions, old version taken off in August? Some- of a page. body made them into a tem- plate that was subsequently deleted. The edit to recover the old links is here: [6] Off-topic remarks Remarks not relating to PLATO IS THE BEST editing the article. MAN ALIVE! LONG LIVE PLATO Polls Formal proposals followed A month should be deleted by statements such as Sup- from the “Deaths in [CUR- port and Oppose, with jus- RENT YEAR]” page ONE tifications. WEEK after the month ends... Requests for peer re- Requests for peer review. Users hoping to elevate arti- view cles to featured status may solicit a peer review.[11] Information boxes Special boxes with informa- See Fig. 2(a), which pro- tion, usually found at the poses and discusses a new top of a Talk page. info box for the Swine in- fluenza article. Images Images posted on the Talk See Fig. 2(b) page. Other The sole exclusive category, “This review is transcluded describes items that don’t from Talk:Wiki/GA1. The fit elsewhere. edit link for this section can be used to add comments to the review.” Table 1. Viégas’ 11 types of Talk pages comments [11] Enhancing MediaWiki Talk pages with Semantics for Better Coordination 5 Classification Definition Example References to sources References to sources, in- Exclusive! Mighty Stef outside the wiki cluding print and deep web records football protest resources, outside this wiki. song”Hot Press. Not sure where to put it but I’ll leave it here as somebody might find it useful... References to reverts, Discussions of reverts, re- I noticed some people edit removed material, or moving material, or contro- the page into what it will be controversial edits versial edits. in 10 minutes but someone is reverting it...just let it be. Reference to edits the Applied when an editor dis- Added the About.com re- discussant made cusses his/her own article view since the review was edits on the Talk page. part of the reception sec- tion. Requests for help with Solicitations for assistance This is just to invite at- another article, portal, elsewhere, or recruiting ed- tention to the page Face- etc. itorial help in the Talk page book statistics just created; for another article. of all interested editors. I have just placed a mergeto tag in it. Thanks. Table 2. Our 4 additional comment types for Talk pages (a) (b) Fig. 2. Comments from the Swine influenza Talk page containing: (a) a proposed in- fobox and, (b) images. – Sources are heavily discussed in Talk pages, and some comments seem to be made soley to deposit a source. While many sources are on the open web (and can be detected as external links), print resources, inexact references, and deep web resources may also be provided. – Disagreements about article content often take place in the context of reverts to the page. Discussions about removing content or editing controversial material may also take place on the Talk page before the article is edited. – The Talk page may be used to notify other editors about a recent edit, perhaps to provide further description, anticipate questions, or clarify that a 6 Jodi Schneider, Alexandre Passant, John G. Breslin suggestion has been implemented. Editors may also explain their own edits in discussions of reverts and edit wars. – The Talk page is often seen as a site for communication with editors who have interest in or knowledge about a given topic. Requests for help, like Requests for information, draw on that perceived expertise. 4 A model for structuring wiki contributions Based on the aforementioned 15 categories (11 from previous work plus the 4 that we introduced), we designed a lightweight vocabulary for annotating Talk pages. The main purpose of this model is to categorize each comment in the wiki page, so that, for example, one could immediately identify all the references to vandalism, all the pages requiring help, or all the sources recommended on the Talk page. This could be useful since editors may specialize, performing a certain type of task repeatedly [12]. Categorization could also facilitate automatically collating comments, for instance transcluding Requests for Information into a more appropriate spot, such as the Wikipedia Reference Desk5 for that category. To that end, we provide a model (applied to a Talk page in Fig. 3): – using existing ontologies, namely FOAF and SIOC, to model the users, the discussion topics (considered as SIOC threads), and the comments. Among others, we reuse the sioct:WikiArticle class from the SIOC Types module and the sioc:has_discussion property that was introduced by some of our previous work regarding modeling wiki structure using semantics [7]. – providing new classes to represent some of the classifications introduced in Section 3. We focused only on the requests and reference categories, for two reasons. First, these are the ones that people might indicate when they add new content (we will describe the process later). It is hard to imagine that someone would mark their own comment as off-topic; however, labeling it a “request for help” seems plausible. Second, these categories seem to be the most relevant for querying and retrieving information. In addition, additional RDF properties could be used, e.g. from the Dublin Code vocabulary. For instance, when making a ReferenceToEdit, specifying a permalink to the edit could be done with dcterms:requires, or when making a ReferenceToSources, specifying the URI of a source with dcterms:references. Our model, available at http://rdfs.org/sioc/wikitalk, then consists of: – A class WikiDiscussionItem. – Two classes, subclasses of the aforementioned one, named ReferenceItem and RequestItem, for references and requests, respectively, that have various subclasses as follows: • For the ReferenceItem class: ◦ ReferenceToEdit; ◦ ReferenceToGuidelinesOrPolicies; 5 http://en.wikipedia.org/wiki/Wikipedia:Reference_desk Enhancing MediaWiki Talk pages with Semantics for Better Coordination 7 ◦ ReferenceToInternalResources; ◦ ReferenceToRevertsOrControversialOrRemovedMaterial; ◦ ReferenceToSources; ◦ ReferenceToVandalism. • For the RequestItem class: ◦ RequestEditingCoordination; ◦ RequestHelpElsewhere; ◦ RequestInfo; ◦ RequestPeer-review Fig. 3. Annotated Talk page 5 Providing and using the annotations 5.1 RDFa Markup Using this model, we then describe the type(s) of each comment, and the struc- tural connections between these comments in MediaWiki Talk pages using RDFa markup. Here is an example before adding the markup (Listing 1.1), and after (Listing 1.2). The extracted RDF is also provided in Listing 1.3. 8 Jodi Schneider, Alexandre Passant, John G. Breslin

< span class =" editsection " >[ < a href ="/ w / index . php ? title = Talk : Semantic_Web & amp ; action = edit & amp ; section =2" title =" Edit section : Opening sentence " > edit ] < span class =" mw - headline " id =" O p e n i n g _ s e n t e n c e " > Opening sentence

Could somebody please put examples of ’ semantic web ’ immediately after the opening sentence ? Otherwise it just sounds a bit waffly and , more importantly , the intelligent lay reader is lost . Thanks . 86.42.96.251 ( < a href ="/ wiki / User_talk :86.42. 96.251" title =" User talk :86.42.96.251" > talk ) 10:38 , 30 March 2009 ( UTC )

Listing 1.1. Example of a comment in a Talk page < div xmlns:sioc = " http: // rdfs . org / sioc / ns # " xmlns:siocwt = " http: // rdfs . org / sioc / wikitalk # " xmlns:content = " http: // purl . org / rss /1.0/ modules / content / " about = " # O p e n i n g _ s e n t e n c e " typeof = " sioc:Thread " rel = " s i o c : h a s _ c o n t a i n e r " href = " / w / index . php ? title = T a l k : S e m a n t i c _ W e b " > < h2 > < span class = " editsection " >[ edit ] < span class = " mw - headline " id = " O p e n i n g _ s e n t e n c e " > Opening sentence

Could somebody please put examples of ’ semantic web ’ immediately after the opening sentence ? Otherwise it just sounds a bit waffly and , more importantly , the intelligent lay reader is lost . Thanks . 86.42.96.251 ( talk ) 10 :38 , 30 March 2009 ( UTC ) Listing 1.2. Example of a comment in a Talk page, with RDFa markup <# post_1 > a siocwt : R e q u e s t E d i t i n g C o o r d i n a t i o n ; content : encoded """ Could somebody please put examples of ’ semantic web ’ immediately after the opening sentence ? Otherwise it just sounds a bit waffly and , more importantly , the intelligent lay reader is lost . Thanks . 86.42.96.251 ( < a href ="/ wiki / User_talk :86.42.9 6.251" title =" User talk :86.42.96.251" > talk ) 10:38 , 30 March 2009 ( UTC ) """^^ rdf : XMLLiteral ; sioc : has_container <# Opening_sentence > . <# Opening_sentence > a sioc : Thread ; sioc : has_container . Listing 1.3. Example of a comment in a Talk page, in Turtle (without prefixes) Enhancing MediaWiki Talk pages with Semantics for Better Coordination 9 5.2 Annotation and extraction tools We are currently developing several services to provide and use the aforemen- tioned annotations. First, we are creating two JavaScript plugins, an annotation plugin and a highlight plugin. Then, we will also investigate the use of SPARQL- based interfaces to query such annotations. While editing the Talk page, an editor could use a JavaScript-based anno- tation plugin to specify which of the 10 classifications of our ontology apply. (Users do say that they are willing to choose the comment type.) The plugin would then generate the applicable RDFa markup. The annotation plugin could also get certain FOAF and SIOC attributes from the username or IP address. The annotation plugin will also facilitate user testing with the Wikipedia com- munity, which may lead to further refinement of the Wikitalk module and its class labels, based on task-based evaluations with frequent wiki editors and other user testing of the annotation process. So far we have created a plugin to use such annotations; relying on the RDFa markup, it uses a JavaScript RDFa parser6 to parse a Talk page and to highlight relevant comments on a single Talk page, based on an ontology category to which they belong. We are currently evaluating this plugin and making improvements based on user feedback. A third application, based on SPARQL, will allow querying to get “views” on the top of MediaWiki pages. For example, the user could “find all references to vandalism posted in the last 2 days” or “find all comments mentioning a source outside Wikipedia”. SPARQL also opens up exciting possibilities, such as automatically collating comments, for instance transcluding Requests for Infor- mation into a more appropriate spot, such as (for Wikipedia) the Reference Desk for that topic, thus enabling new ways to automatically gather particular kind of comments, and facilitating the coordination process in MediaWiki instances. 6 Conclusion Talk pages, as we have seen, are highly used, making it challenging to find relevant comments. To help fill this need, we used a 15-item classification for MediaWiki Talk page comments, extended from Viégas, and then developed a new lightweight ontology extending SIOC to represent the relevant categories. We then enhanced MediaWiki Talk pages with RDFa markup to indicate com- ment types and structural elements. That markup can in ongoing and future work be extracted with JavaScript and SPARQL, making the content of such pages easier to parse and to understand. While the classifications in Tables 1 and 2 suit our immediate purpose, other alternatives are possible. Different classifications aiming towards a different on- tology might focus more narrowly on the changes suggested (or indicated as made) by each comment (see, e.g. Table 3 in Stvilia [10]). Alternately, an on- tology dedicated to a particular wiki could be based on information quality 6 http://www.w3.org/2001/sw/BestPractices/HTML/rdfa-bookmarklet/ 10 Jodi Schneider, Alexandre Passant, John G. Breslin dimensions and editorial policies specific to that wiki. As our work progresses, we will be guided by user evaluations, to discover which such approaches might be beneficial for editors collaborating in wiki spaces. References 1. Nicolas Bencherki and Jeanne d’Arc Uwatowenimana. Writing a Wikipedia ar- ticle: Data mining and organizational communication to explain the practices by which contributors maintain the article’s coherence. In Annual Meeting of the International Communication Association, Montreal, Quebec, May 2008. 2. John G. Breslin, Andreas Harth, Uldis Bojars, and Stefan Decker. Towards Semantically-Interlinked Online Communities. In The Semantic Web: Research and Applications, Proceedings of the 2nd European Semantic Web Conference (ESWC ’05), number 3532 in LNCS, pages 500–514. Heraklion, Greece, 2005. 3. Katherine Ehmann, Andrew Large, and Jamshid Beheshti. Collaboration in con- text: Comparing article evolution among subject disciplines in Wikipedia. First Monday, 13(10), October 2008. 4. Sean Hansen, Nicholas Berente, and Kalle Lyytinen. Wikipedia as rational dis- course: An illustration of the emancipatory potential of information systems. In 40th Annual Hawaii International Conference on System Sciences, 2007. 5. Aniket Kittur and Robert E. Kraut. Beyond Wikipedia: Coordination and conflict in online production groups. In CSCW 2010. ACM, February 2010. 6. Travis Kriplean, Ivan Beschastnikh, David W. McDonald, and Scott A. Golder. Community, consensus, coercion, control: cs*w or how policy mediates mass par- ticipation. In Proceedings of the 2007 International ACM Conference on Supporting Group Work, pages 167–176, Sanibel Island, Florida, 2007. ACM. 7. Fabrizio Orlandi and Alexandre Passant. Enabling cross-wikis integration by ex- tending the SIOC ontology. In Proceedings of the Fourth Semantic Wiki Workshop (SemWiki 2009), co-located with 6th European Semantic Web Conference (ESWC 2009), volume 464, Hersonissos, Heraklion, Crete, Greece, June 2009. 8. Christian Pentzold and Sebastian Seidenglanz. Foucault@Wiki first steps towards a conceptual framework for the analysis of wiki discourses. In WikiSym ’06: Pro- ceedings of the 2006 International Symposium on Wikis, 2006. 9. Jodi Schneider, Alexandre Passant, and John G. Breslin. A content analysis: How Wikipedia talk pages are used. In WebScience 2010, Raleigh, North Carolina, April 2010. http://websci10.org/. 10. Besiki Stvilia, Michael B. Twidale, Linda C. Smith, and Les Gasser. Informa- tion quality work organization in Wikipedia. Journal of the American Society for Information Science and Technology, 59(6):983–1001, 2008. 11. Fernanda B. Viégas, Martin Wattenberg, Jesse Kriss, and Frank van Ham. Talk before you type: Coordination in Wikipedia. In 40th Annual Hawaii International Conference on System Sciences, pages 78–87, 2007. 12. Howard T. Welser, Dan Cosley, Gueorgi Kossinets, Austin Lin, Fedor Dokshin, Geri Gay, and Marc Smith. Finding social roles in Wikipedia. In Proceedings of the American Sociological Association 2008, Boston, MA, 2008.