<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Semantic Representation of Provenance in Wikipedia</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Fabrizio</forename><surname>Orlandi</surname></persName>
							<email>fabrizio.orlandi@deri.org</email>
						</author>
						<author>
							<persName><forename type="first">Pierre-Antoine</forename><surname>Champin</surname></persName>
							<email>pchampin@liris.cnrs.fr</email>
						</author>
						<author>
							<persName><forename type="first">Alexandre</forename><surname>Passant</surname></persName>
							<email>alexandre.passant@deri.org</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Digital Enterprise Research Institute</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">National University of Ireland</orgName>
								<address>
									<settlement>Galway Galway</settlement>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="laboratory">LIRIS</orgName>
								<orgName type="institution" key="instit1">Université de Lyon</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<orgName type="institution" key="instit3">UMR5205</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="institution">Université Claude Bernard Lyon</orgName>
								<address>
									<postCode>F-69622</postCode>
									<settlement>Villeurbanne</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff4">
								<orgName type="institution">Digital Enterprise Research Institute</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff5">
								<orgName type="institution">National University of Ireland</orgName>
								<address>
									<settlement>Galway Galway</settlement>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Semantic Representation of Provenance in Wikipedia</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1FCF7C0005C2512672D1CF6BB58CDB4C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T19:40+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Wikis are often considered as being a wide source of information. However, identifying provenance information about their content is crucial, whether it is for computing trust in public wiki pages or to identify experts in corporate wikis. In this paper, we address this issue by providing a lightweight ontology for provenance management in wikis, based on the W7 model. Furthermore, we showcase the use of our model in a framework that computes provenance information in Wikipedia, also using DBpedia to compute provenance and contribution information per category, and not only per page.</p><p>This work is funded by the Science Foundation Ireland under grant number SFI/08/CE/I1380 (Líon 2) and by an IRCSET scholarship.</p><p>1 MediaWiki is the wiki engine that powers Wikipedia -www.mediawiki.org</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>From public encyclopedia to corporate knowledge management tools, wikis are often considered as being a wide source of information. Yet, since wikis generally offer an open publishing process where everyone can contribute, identifying provenance information in their pages is an important requirement. In particular this information can be used to identify trust values for pages or pages fragments <ref type="bibr" target="#b1">[2]</ref> as well as for identifying experts based on the number of contributions <ref type="bibr" target="#b8">[9]</ref> and other criteria such as the users' social graphs <ref type="bibr" target="#b9">[10]</ref> etc. By providing this information as RDF <ref type="bibr" target="#b5">[6]</ref>, provenance metadata becomes more transparent and offers new opportunities for the previous use-cases, as well as letting people link to provenance information from other sources, and personalizing trust metrics based on the trust they have to a person regarding a particular topic <ref type="bibr" target="#b4">[5]</ref>.</p><p>This paper describes three of our contributions to address this issue and make provenance information in MediaWikipowered wikis 1 available on the Semantic Web:</p><p>1) a lightweight ontology to represent provenance information in wikis, based on the W7 theory <ref type="bibr" target="#b12">[13]</ref> and using SIOC and its extensions; 2) a software architecture to extract and model provenance information about Wikipedia pages and categories, using the aforementioned ontology; 3) a user-interface to make this information openly available on the Web, both to human and software agents and directly within Wikipedia pages.</p><p>In the next section, we discuss some related work in the realm of provenance management on the Semantic Web. Then, we give some background information regarding SIOC and various extensions used in our work. In Section IV, we present the W7 theory and the lightweight ontology we have built to represent it in RDFS. We then describe our software architecture and how we compute provenance information in Wikipedia and finally present the user-interface to access this information, before concluding the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>The representation and extraction of provenance information is not a recent research topic. Many studies have been conducted for representing provenance of data <ref type="bibr" target="#b14">[15]</ref>, but few of them have been focused on integrating provenance information into the Web of data <ref type="bibr" target="#b5">[6]</ref>. Providing this information as RDF would make provenance meta-data more transparent and interlinked with other sources, and it would also offer new scenarios on evaluating trust and data quality on the top of it. In this regard a W3C Provenance Incubator Group<ref type="foot" target="#foot_0">2</ref> has been recently established. The mission of the group is to "provide a stateof-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization". Requirements for provenance on the Web<ref type="foot" target="#foot_1">3</ref> , as well as several use cases and technical requirements have been provided by the working group. A comprehensive analysis of approaches and methodologies for publishing and consuming provenance metadata on the Web is exposed in <ref type="bibr" target="#b6">[7]</ref>.</p><p>Another research topic relevant to our work is the evaluation of trust and data quality in wikis. Recent studies proposed several different algorithms for wikis that would automatically calculate users' contributions and evaluate their quantity and quality in order to study the authors' behavior, produce trust measures of the articles and find experts. WikiTrust <ref type="bibr" target="#b1">[2]</ref> is a project aimed at measuring the quality of author contributions on Wikipedia. They developed a tool that computes the origin and author of every word on a wiki page, as well as "a measure of text trust that indicates the extent with which text has been revised" <ref type="foot" target="#foot_2">4</ref> . On the same topic other researchers tried to solve the problem of evaluating articles' quality, not only examining quantitatively the users' history <ref type="bibr" target="#b8">[9]</ref>, but also using social network analysis techniques <ref type="bibr" target="#b9">[10]</ref>.</p><p>From our perspective, there is a need of publishing provenance information as Linked Data from websites hosting a wide source of information (such as Wikipedia). Yet, most of the work on provenance of data is, either not focused on integrating the information generated on the Web of data, or mainly based on provenance for resource descriptions or already structured data. On the other hand, the interesting work done so far on analyzing trust and quality on wikis does not take into account the importance of making the information extracted available on the Web of data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. BACKGROUND</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Using SIOC for wiki modelling</head><p>The SIOC Ontology -Semantically-Interlinked Online Communities <ref type="bibr" target="#b0">[1]</ref> -provides a model for representing online communities and their contributions <ref type="foot" target="#foot_3">5</ref> . It is mainly centered around the concepts of users, items and containers, so it can be used to model content created by a particular user on several platforms, enabling a distributed perspective to the management of User-Generated Content on the Web. In particular, the atomic elements of the Web applications described by SIOC are called Items. They are grouped in Containers, that can themselves be contained in other Containers. Finally, every Container belongs to a Space. As an example, a Site (subclass of Space) may contain a number of Wikis (subclass of Container) and every Wiki contains a set of WikiArticles (subclass of Item) generated by UserAccounts. For more details about SIOC, we invite the reader to consult the W3C Member Submission <ref type="bibr" target="#b0">[1]</ref> and its online specification <ref type="foot" target="#foot_4">6</ref> .</p><p>While the SIOC Types module provides several subclasses of Container and Item, including Wiki and WikiArticle, some characteristics of wikis required further modelling. Hence, in our previous work <ref type="bibr" target="#b10">[11]</ref> we extended the SIOC Ontology to take into account such characteristics (e.g. multi-authoring, versioning, etc.). Then, some tools to generate and consume data from wikis using our model have also been developed <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. The SIOC Actions module</head><p>While SIOC represents the state of a community at a given time, SIOC-actions <ref type="bibr" target="#b3">[4]</ref> can be used to represent their dynamics, i.e. how they evolve. Hence, SIOC provides a document-centric view of online communities and SIOCactions focuses on an action-centric view. More precisely, the evolution of an online community is represented as a set of actions, performed by a user (sioc:UserAccount), at some time, and impacting a number of objects (sioc:Item). SIOC-actions provides an extensible hierarchy of properties for representing the effect of an action on its items, such as creates, modifies, uses, etc. Besides the SIOC ontology, SIOC-actions relies on the vocabulary for Linking Open Descriptions of Events (LODE) <ref type="foot" target="#foot_5">7</ref> . The core of the module is the Action class (subclass of event:Event from the Event Ontology) which is a timestamped event involving an agent (e.g. a UserAccount) and a number of digital artifacts (e.g. Items). For more details about SIOC Actions and its implementation see the following Sec. IV.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. REPRESENTING THE W7 MODEL USING RDFS/OWL</head><p>The W7 model is an ontological model created to describe the semantics of data provenance <ref type="bibr" target="#b12">[13]</ref>. It is a conceptual model and to the best of our knowledge a RDFS/OWL representation of this model has not been implemented yet. Hence we will focus on an implementation of this model for the specific context of wikis. As a comparison, in <ref type="bibr" target="#b13">[14]</ref> the authors use the example of Wikipedia to illustrate theoretically how their proposed W7 model can capture domain or application specific provenance.</p><p>The W7 model is based on the Bunge's Ontology <ref type="bibr" target="#b2">[3]</ref>, furthermore it is built on the concept of tracking the history of the events affecting the status of things during their life cycle. In this particular case we consider the data life cycle. The Bunge's ontology, developed in 1977, is considered as one of the main sources of constructs to model real systems and information systems. Since the Bunge's work is a theoretical work, there has been some effort from the scientific community to translate his work into machine readable ontologies <ref type="foot" target="#foot_6">8</ref> .</p><p>The W7 model represents data provenance using seven fundamental elements or interrogative words: what, when, where, how, who, which, and why. It has been purposely built with general and extensible principles, hence it is possible to capture provenance semantics for data in different domains. We refer to <ref type="bibr" target="#b12">[13]</ref> for a detailed description of the mappings between W7 and Bunge's models, and in Table <ref type="table">I</ref> we provide a summary of the W7 elements (as in <ref type="bibr" target="#b13">[14]</ref>). Looking at the structure of the W7 model it is clear the motivation why we chose the SIOC Actions module as core of our model. Most of the concepts in the Actions module are the same as in the W7 model. Furthermore wikis are community sites and the Actions module has been implemented to represent dynamic, action-centric views of online communities.</p><p>In the following sections we give a detailed description of how we answered each of these seven questions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. What</head><p>The What element represents an event that affected data during its life cycle. It is a change of state and the core of the model. In this regard, there are three main events affecting data: creation, modification and deletion. In the context of wikis, each of them can appear: users can (1) add new sentences (or characters), <ref type="bibr" target="#b1">(2)</ref>  in the same position of the article. In addition, in systems like Wikipedia, some other specific events can affect the data on the wiki, for example "quality assessment" or "change in access rights" of an article <ref type="bibr" target="#b13">[14]</ref>; however, they can be expressed with the three broader types defined above.</p><p>Since (1) wikis commonly provide a versioning mechanism for their content and (2) every action on a wiki article leads to the generation of a new article revision, the core event describing our What element is the creation of an article version. In particular we model this creation, and the related modification of the latest version (i.e. the permalink), using the SIOC-Actions model as shown in Listing 1. &lt;http://example.com/action?title=Dublin_Core#380106133&gt; sioca:creates &lt;http://en.wikipedia.org/w/index.php? title=Dublin_Core&amp;oldid=380106133&gt;; sioca:modifies &lt;http://en.wikipedia.org/wiki/ Dublin_Core&gt;; a sioca:Action.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Listing 1. Representing the "What" element</head><p>As we can see from the example above expressed in Turtle syntax, we have a sioca:Action identified by the URI http://example.com/action?title=Dublin Core# 380106133 that leads to the creation of a revision of the main wiki article about "Dublin Core". The creation of a new revision was originated by a modification (sioca:modifies) of the main Wikipedia article http://en.wikipedia.org/wiki/ Dublin Core . Details about the type of event are exposed in the next section about the How element, where we identify the type of action involved in the event creation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. How</head><p>The How element in W7 is an equivalent to the Action element from Bunge's ontology, and describes the action leading to an event. In wikis, the possible actions leading to an event (i.e. the creation of a new revision) are all the edits applied to a specific article revision. By analyzing the diff between two subsequent revisions of a page, we can identify the type of action involved in the creation of the newer revision. In particular we focus on modelling the following types of edits: Insertion, Update and Deletion of both Sentences and References. With the term Sentence here we refer to every sequence of characters that does not include a reference or a link to another source, and with Reference we refer to every action that involves a link or a so-called Wikipedia reference. As discussed in <ref type="bibr" target="#b13">[14]</ref>, another type of edit would be a Revert, or an undo of the effects of one or more edits previously happening. However, in Wikipedia, a revert does not restore a previous version of the article, but creates a new version with content similar to the one from an earlier selected version. In this regard, we decided to model a revert as all the other edits, and not as a particular pattern. The distinction between a revert and other types of action can be yet identified, with an acceptable level of precision, by looking at the user comment entered when doing the revert, since most users add a related revert comment <ref type="foot" target="#foot_7">9</ref> .</p><p>Going further, and to represent provenance data for the action involved in each wiki edit, we modelled the diffs appearing between pages. To model the differences calculated between subsequent revisions we created a lightweight Diff ontology, inspired by the Changeset vocabulary <ref type="foot" target="#foot_8">10</ref> . Yet, instead of describing changes to RDF statements, our model aims at describing changes to plain text documents. It provides a main class, the diff:Diff class, and six subclasses: SentenceUpdate, SentenceInsertion, SentenceDeletion and ReferenceUpdate, ReferenceInsertion, ReferenceDeletion, based on the previous How patterns. The main Diff class represents all information about the change between two versions of a wiki page (see Fig. <ref type="figure" target="#fig_0">1</ref>). The Diff's properties subjectOfChange and objectOfChange point respectively to the version changed by this diff and to the newly created version. Details about the time and the creator of the change are provided respectively by dc:created and sioc:has_creator. Moreover, the comment about the change is provided by the diff:comment property with range rdfs:Literal. In Figure <ref type="figure" target="#fig_0">1</ref> we also display a Diff class linking to another Diff class. The latter represents one of the six Diff subclasses described earlier in this section. Since a single diff between two versions can be composed by several atomic changes (or "sub-diffs"), a Diff class can then point to several subclasses using the dc:hasPart property. Each Diff subclass can have maximum one TextBlock removed and one added: if it has both, then the type of change is an Update, otherwise the type would be an Insertion or a Deletion.</p><p>The TextBlock class is part of the Diff ontology and represents a sequence of characters added or removed in a specific position of a plain text document. It exposes the content itself of this sequence of characters (content) and a pointer to its position inside the document (lineNumber). It is important to precise that usually the document content is organized in sets of lines, as in wiki articles, but this class is generic enough to be reusable with other types of text organization. To note also that each of the six subclasses of the Diff class inherit the properties defined for the parent class, but unfortunately this is not displayed in Figure <ref type="figure" target="#fig_0">1</ref> for space reasons.</p><p>With the model presented it is possible to address an important requirement for provenance: the reproducibility of a process. Starting from an older revision of a wiki article, just following the diffs between the newer revisions and the TextBlocks added or removed, it is possible to reconstruct the latest version of the article. This approach goes a step further than just storing the different data versions: it provides details of the entire process involved in the data life cycle.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. When</head><p>The When element in W7 is equivalent to the Time element from Bunge's ontology, and obviously refers to the time an event occurs, which is recorded in every wiki platform for page edits. As depicted in Figure <ref type="figure" target="#fig_0">1</ref>, each Diff class is linked to the timestamp of the event using the dc:created property. The same timestamp is also linked to each Diff subclass using the same property (not shown in Fig. <ref type="figure" target="#fig_0">1</ref> for space reasons). The time of the event is modelled with more detail in the Action element as shown in the following Listing 2 11  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Listing 2. Representing the "When" element in Turtle syntax</head><p>In this context we consider actions to be instantaneous. As in <ref type="bibr" target="#b3">[4]</ref> we track the instant that an action is taking effect on a wiki (i.e. when a wiki page is saved). Usually, this creation time is represented using dc:created. Another option, provided by the LODE ontology, uses the lode:atTime property to link to a class representing a time interval or an instant. 11 For all the namespaces see: http://prefix.cc</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Where</head><p>The Where element represents the online "Space" or the location associated with an event. In wikis, and in particular in Wikipedia, this is one of the most controversial elements of the W7 model. If the location of an article update might be considered as the location of the user when updating the content, then this information on Wikipedia is not completely provided or accurate. Indeed we can extract this information only from the IP address of the anonymous users but not from all the Wikipedia users. To note that is possible to link a sioc:UserAccount (e.g. http://en.wikipedia.org/ wiki/User:96.245.230.136 ) to the related IP address using the SIOC ip_address property.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Who</head><p>The Who element describes an agent involved in an event, therefore it includes a person or an organization. On a wiki it represents the editor of a page, and it can be either a registered user or an anonymous user. A registered user might also have different roles in the Wikipedia site and, on this basis, different permissions are granted to its account. With this work we are only interested in keeping track of the user account involved in each event, and not also in the role on the wiki. Users are modelled with the sioc:UserAccount class and linked to each sioca:Action, sioct:WikiArticle and diff:Diff with the property sioc:has_creator. A sioc:UserAccount represents a user account, in an online community site, owned by a physical person or a group or an organization (i.e. a foaf:Agent). Hence a physical person, represented by a foaf:Person subclass of foaf:Agent, can be linked to several sioc:UserAccount. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Which</head><p>The Which element represents the programs or the instruments used in the event. In our particular case it is the software used in editing the event, which might be a bot or the wiki software used by the editor. Since there is not a direct and precise way to identify whether the edit has been made by a human or a bot, our model does not make this distinction. A naive method could be to look at the username and check if it contains the "bot" string.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Why</head><p>The Why element represents the reasons behind the event occurrence. On Wikipedia it is defined by the justifications for a change inserted by a user in the "comment" field. This is not a mandatory field for the user when editing a wiki page but the Wikipedia guidelines recommend to fill-in this text field. We model the comment left by the user with a property diff:comment linking the diff:Diff class to the related rdfs:Literal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. APPLICATION USING PROVENANCE DATA FROM WIKIPEDIA</head><p>A. Collecting the data from the Web</p><p>In order to validate and test our modelling solution for provenance on wikis and in particular from the Wikipedia website, we collected data from the English Wikipedia and the DBpedia service. The DBpedia project 12 since it extracts and publishes structured information from the English Wikipedia, is considered as its RDF export. Collecting data not only from Wikipedia but also from the DBpedia source has an important advantage: it directly provides us structured data modelled with popular standard lightweight ontologies in RDF. We use the DBpedia data especially for the categories that hierarchically structure the articles on Wikipedia. We ran our experiment collecting a portion of the Wikipedia articles, and in particular the articles belonging to the whole hierarchy under a given category. By doing this we could limit our dataset only to articles strongly related with each other, and collect a user community with the same interest in common.</p><p>A PHP script has been developed to extract all the articles belonging to a category and all its subcategories, and for each article all its revision history. More in detail, this program:</p><p>• Executes a SPARQL 13 query over the DBpedia endpoint to get the categories hierarchy; • Stores the categories hierarchy (modelled with the SKOS 14 vocabulary) in a local triplestore; • Queries again the DBpedia endpoint to get all the articles belonging to the categories collected; • For all the articles collected it generates (and stores locally) RDF data using the SIOC-MediaWiki exporter 15 ; • Using the sioc:previous_version property it exports RDF for all the previous revisions of each article. It is clear the advantage of using DBpedia in this process since we collected structured data just executing two lightweight SPARQL queries.</p><p>A second PHP script has been developed to extract detailed provenance information from the articles collected with the previous step. This script calculates the diff function between consecutive versions of the articles, and retrieves more related information from the Wikipedia API. The data retrieved from the API is composed by all the information needed for the creation of the model described in the previous section. Therefore information about the editor, the timestamp, the comment and the ID of the versions are identified. Moreover the algorithm is not only capable of extracting the diff function, but also 12 http://dbpedia.org 13 Query Language for RDF: http://www.w3.org/TR/rdf-sparql-query/ 14 SKOS Reference: http://www.w3.org/TR/skos-reference/ 15 http://ws.sioc-project.org/mediawiki/ to compute the type of change for each of the differences identified. This allows us to mark each change with one of the Sentence or Reference Insertion/Update/Deletion subclasses of the diff:Diff class. Finally the script generates RDF data with the model described before and inserts it in the local triplestore. In order to test our application we ran the data extraction algorithm starting from the category "Semantic Web" on the English Wikipedia, and we generated data for all the 166 wiki articles belonging to this category and its subcategories recursively. As we can see, using Semantic Web technologies, we have the advantage of having a single and standard language to query wiki and provenance data together, while developers that need to query original systems have to learn a new API for each new system we want to query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. A Firefox plug-in for provenance from Wikipedia</head><p>In order to show the potential of the data collected and the data model created, we built an application to show some interesting statistics extracted from provenance information of the analyzed articles. The application displays a table directly on the top of each Wikipedia article exposing some information about the most active users on the article and their edits. In particular this has been developed using a Greasemonkey<ref type="foot" target="#foot_9">16</ref> script: a Mozilla Firefox extension that allows users to install scripts that make on-the-fly changes to HTML web page content. This script is developed in JavaScript language and is now compatible with other popular Web browsers. The structure of the application is then composed by the following elements: 1) The triplestore containing the data collected and exposing a SPARQL endpoint for querying the data; 2) A PHP script, used as an interface between the Greasemonkey script and the triplestore; 3) A Greasemonkey script, which retrieves the URL of the Wikipedia loaded page, sends the request to the PHP script and then displays the returned HTML data on the Wikipedia page. The PHP script in this application is important because it is responsible for executing the SPARQL queries on the triplestore. Furthermore it retrieves the results and creates the HTML code to embed on the Wikipedia page. A screenshot of the result of the process is displayed in Figure <ref type="figure">3</ref>.</p><p>The tables displayed in Figure <ref type="figure">3</ref> appear only on the top of the Wikipedia articles and categories that we analyzed with the method described in Section V-A. A different type of table is showed when the page visited is a category page. In Figure <ref type="figure">3</ref> on the top table, we can see the top six users who did the biggest number of edits on the article. For each of these users we then compute: (1) their total number of edits on the page;</p><p>(2) their percentage of "ownership" on the page (or better, the percentage of their edits compared to all the edits done on the article); (3) their number of lines added on the article; (4) their number of lines removed on the article; (5) their total number of lines added and removed on all the articles belonging to the category "Semantic Web". With the other use-case, when the user visits a Wikipedia category page, we display different Fig. <ref type="figure">3</ref>. A screenshot of the application on the "Linked Data" page and the table from the Category "Semantic Web" page types of information but using the same method. See the table on the bottom in Figure <ref type="figure">3</ref>. Browsing a wiki category page, the application shows a list of the users with the biggest number of edits on the articles of the whole category (and related subcategories). It also shows the related percentages of their edits compared to the total edits on the category. The second table on the right exposes a list of the most edited articles in the category during the last three months. To note also that at the bottom of each table there is a link pointing to a page where a longer list of results will be displayed. At the moment the PHP script developed is available at http: //vmuss06.deri.ie/WikiProvenance/index.php. Just using this script is possible to have the same information displayed using the Greasemonkey script and also to have the RDF descriptions of the page requested. In order to represent these statistical information in RDF, we use SCOVO, the Statistical Core Vocabulary <ref type="bibr" target="#b7">[8]</ref>. It relies on the concept of Item and dimensions to represent statistical information. In our context, the item is one piece of statistical information (e.g. user "X" edited 10 lines on page "Y"), and various items are involved in the description: (1) the type of information that we want to represent (number of edits, percentage, lines added and removed etc.); (2) the page or the category impacted;</p><p>(3) the user involved. Hence, we created four instances of scv:Dimension to represent the first dimension, and relied then simply on the scv:dimension property for the other ones. As an example, the following snippet represents that the user KingsleyIdehen made 11 edits on the SIOC page. The goal of this paper was to provide a solution for representing and managing provenance of data from Wikipedia (and other wikis) using Semantic Web technologies. To solve this problem we provided: a specific lightweight ontology for provenance in wikis, based on the W7 model; a framework for the extraction of provenance data from Wikipedia; an application for accessing the generated data in a meaningful way and exposing it to the Web of data. We showed that the W7 model is a good choice for modelling provenance information in general and in wikis but, because of its high abstraction level, it has to be refined using for instance other specific lightweight ontologies. In our case this has been done using SIOC and the Actions module. Future developments will include a refinement of the proposed model and a subsequent alignment with other general-purpose ontologies for representing provenance as Linked Data (e.g. the Open Provenance Model). We also plan to improve and extend the potentialities of our application offering more features, and providing a wider range of data with an architecture that automatically updates the data as soon as it changes on Wikipedia.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Modeling differences in plain text documents with the Diff vocabulary</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Modeling the Who element with sioc:UserAccount</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="6,48.96,54.00,256.20,206.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>remove sequences of characters, or (3) modify characters by removing and then adding content</figDesc><table><row><cell>Provenance</cell><cell>Construct</cell><cell>Definition</cell></row><row><cell>element</cell><cell>in Bunge's</cell><cell></cell></row><row><cell></cell><cell>ontology</cell><cell></cell></row><row><cell>What</cell><cell>Event</cell><cell>An event (i.e. change of state) that happens</cell></row><row><cell></cell><cell></cell><cell>to data during its life time</cell></row><row><cell>How</cell><cell>Action</cell><cell>An action leading to the events. An event may</cell></row><row><cell></cell><cell></cell><cell>occur, when it is acted upon by another thing,</cell></row><row><cell></cell><cell></cell><cell>which is often a human or a software agent</cell></row><row><cell>When</cell><cell>Time</cell><cell>Time or more accurately the duration of an</cell></row><row><cell></cell><cell></cell><cell>event</cell></row><row><cell>Where</cell><cell>Space</cell><cell>Locations associated with an event</cell></row><row><cell>Who</cell><cell>Agent</cell><cell>Agents including persons or organizations in-</cell></row><row><cell></cell><cell></cell><cell>volved in an event</cell></row><row><cell>Which</cell><cell>Agent</cell><cell>Instruments or software programs used in the</cell></row><row><cell></cell><cell></cell><cell>event</cell></row><row><cell>Why</cell><cell>-</cell><cell>Reasons that explain why an event occurred</cell></row><row><cell></cell><cell></cell><cell>TABLE I</cell></row><row><cell cols="3">DEFINITION OF THE 7 WS BY RAM S. AND LIU J.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>.</figDesc><table><row><cell>&lt;http://example.com/action?title=Dublin_Core#380106133&gt;</cell></row><row><cell>dc:created "2010-08-21T06:36:17Z"ˆˆ&lt;http://www.w3.org</cell></row><row><cell>/2001/XMLSchema#dateTime&gt;;</cell></row><row><cell>lode:atTime [</cell></row><row><cell>a time:Instant;</cell></row><row><cell>time:inXSDDateTime "2010-08-21T06:36:17Z"ˆˆ&lt;http://</cell></row><row><cell>www.w3.org/2001/XMLSchema#dateTime&gt;.</cell></row><row><cell>];</cell></row><row><cell>a sioca:Action.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">established in September 2009. http://www.w3.org/2005/Incubator/prov/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">http://www.w3.org/2005/Incubator/prov/wiki/User Requirements</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">WikiTrust: http://wikitrust.soe.ucsc.edu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">http://sioc-project.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">http://rdfs.org/sioc/spec/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">LODE Ontology specification -http://linkedevents.org/ontology/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">Evermann J. provides an OWL description of the Bunge's ontology at: http://homepages.mcs.vuw.ac.nz/ ∼ jevermann/Bunge/v5/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">Note that we could also compare the n-1 and n+1 version of each page to identify if a change is a revert</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">The Changeset schema: http://purl.org/vocab/changeset/schema#</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_9">http://www.greasespot.net/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">SIOC Core Ontology Specification</title>
		<ptr target="http://www.w3.org/Submission/sioc-spec/" />
	</analytic>
	<monogr>
		<title level="m">W3C Member Submission 12</title>
				<imprint>
			<publisher>World Wide Web Consortium</publisher>
			<date type="published" when="2007-06">June 2007. 2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Measuring author contributions to the wikipedia</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">T</forename><surname>Adler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>De Alfaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Pye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vishwanath</forename><surname>Raman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of WikiSym &apos;08</title>
				<meeting>WikiSym &apos;08</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Treatise on Basic Philosophy: Ontology I: The Furniture of the World</title>
		<author>
			<persName><forename type="first">Mario</forename><surname>Bunge</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1977">1977</date>
			<publisher>Riedel</publisher>
			<pubPlace>Boston</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">SIOC in Action -Representing the Dynamics of Online Communities</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Champin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th International Conference on Semantic Systems (I-SEMANTICS 2010)</title>
				<meeting>the 6th International Conference on Semantic Systems (I-SEMANTICS 2010)</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Trust networks on the semantic web</title>
		<author>
			<persName><forename type="first">J</forename><surname>Golbeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Parsia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hendler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Cooperative Information Agents VII</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="238" to="249" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Provenance information in the web of data</title>
		<author>
			<persName><forename type="first">Olaf</forename><surname>Hartig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2nd Workshop on Linked Data on the Web (LDOW 2009) at WWW</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Publishing and Consuming Provenance Metadata on the Web of Linked Data</title>
		<author>
			<persName><forename type="first">Olaf</forename><surname>Hartig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 3rd Int. Provenance and Annotation Workshop</title>
				<meeting>3rd Int. Provenance and Annotation Workshop</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">SCOVO: Using statistics on the Web of data</title>
		<author>
			<persName><surname>Hausenblas</surname></persName>
		</author>
		<author>
			<persName><surname>Halb</surname></persName>
		</author>
		<author>
			<persName><surname>Raimond</surname></persName>
		</author>
		<author>
			<persName><surname>Feigenbaum</surname></persName>
		</author>
		<author>
			<persName><surname>Ayers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Semantic Web in Use Track of the 6th European Semantic Web Conference (ESWC2009)</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Social Rewarding in Wiki Systems-Motivating the Community</title>
		<author>
			<persName><surname>Hoisl</surname></persName>
		</author>
		<author>
			<persName><surname>Aigner</surname></persName>
		</author>
		<author>
			<persName><surname>Miksch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd international conference on Online communities and social computing</title>
				<meeting>the 2nd international conference on Online communities and social computing</meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="362" to="371" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Evaluating authoritative sources using social networks: an insight from Wikipedia</title>
		<author>
			<persName><surname>Nt Korfiatis</surname></persName>
		</author>
		<author>
			<persName><surname>Poulos</surname></persName>
		</author>
		<author>
			<persName><surname>Bokos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Online Information Review</title>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Enabling cross-wikis integration by extending the SIOC ontology</title>
		<author>
			<persName><forename type="first">Fabrizio</forename><surname>Orlandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexandre</forename><surname>Passant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th Semantic Wiki Workshop</title>
				<meeting><address><addrLine>SemWiki</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Semantic Search on Heterogeneous Wiki Systems</title>
		<author>
			<persName><forename type="first">Fabrizio</forename><surname>Orlandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexandre</forename><surname>Passant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Symposium on Wikis (Wik-iSym2010</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Understanding the semantics of data provenance to support active conceptual modeling</title>
		<author>
			<persName><forename type="first">Sudha</forename><surname>Ram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun</forename><surname>Liu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Springer</publisher>
			<biblScope unit="page" from="17" to="29" />
			<pubPlace>Berlin / Heidelberg</pubPlace>
		</imprint>
	</monogr>
	<note>lncs edition</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A New Perspective on Semantics of Data Provenance</title>
		<author>
			<persName><forename type="first">Sudha</forename><surname>Ram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">First International Workshop on the role of Semantic Web in Provenance Management</title>
				<meeting><address><addrLine>SWPM</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">A survey of data provenance techniques</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">L</forename><surname>Simmhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Plale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gannon</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page">47405</biblScope>
			<pubPlace>Bloomington IN</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Computer Science Department, Indiana University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
