<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Representation of Provenance in Wikipedia</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Fabrizio Orlandi Pierre-Antoine Champin Alexandre Passant Digital Enterprise Research Institute LIRIS, Universite ́ de Lyon, CNRS, UMR5205 Digital Enterprise Research Institute National University of Ireland, Galway Universite ́ Claude Bernard Lyon 1, F-69622 National University of Ireland</institution>
          ,
          <addr-line>Galway Galway, Ireland Villeurbanne, France Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Wikis are often considered as being a wide source of information. However, identifying provenance information about their content is crucial, whether it is for computing trust in public wiki pages or to identify experts in corporate wikis. In this paper, we address this issue by providing a lightweight ontology for provenance management in wikis, based on the W7 model. Furthermore, we showcase the use of our model in a framework that computes provenance information in Wikipedia, also using DBpedia to compute provenance and contribution information per category, and not only per page.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        From public encyclopedia to corporate knowledge
management tools, wikis are often considered as being a wide
source of information. Yet, since wikis generally offer an open
publishing process where everyone can contribute, identifying
provenance information in their pages is an important
requirement. In particular this information can be used to identify
trust values for pages or pages fragments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as well as for
identifying experts based on the number of contributions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
and other criteria such as the users’ social graphs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] etc.
By providing this information as RDF [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], provenance
metadata becomes more transparent and offers new opportunities
for the previous use-cases, as well as letting people link to
provenance information from other sources, and personalizing
trust metrics based on the trust they have to a person regarding
a particular topic [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        This paper describes three of our contributions to address
this issue and make provenance information in
MediaWikipowered wikis 1 available on the Semantic Web:
1) a lightweight ontology to represent provenance
information in wikis, based on the W7 theory [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and using
SIOC and its extensions;
2) a software architecture to extract and model provenance
information about Wikipedia pages and categories, using
the aforementioned ontology;
3) a user-interface to make this information openly available
on the Web, both to human and software agents and
directly within Wikipedia pages.
      </p>
      <p>This work is funded by the Science Foundation Ireland under grant number
SFI/08/CE/I1380 (L´ıon 2) and by an IRCSET scholarship.</p>
      <p>1MediaWiki is the wiki engine that powers Wikipedia – www.mediawiki.org
In the next section, we discuss some related work in the
realm of provenance management on the Semantic Web. Then,
we give some background information regarding SIOC and
various extensions used in our work. In Section IV, we
present the W7 theory and the lightweight ontology we have
built to represent it in RDFS. We then describe our software
architecture and how we compute provenance information in
Wikipedia and finally present the user-interface to access this
information, before concluding the paper.</p>
      <p>
        The representation and extraction of provenance
information is not a recent research topic. Many studies have been
conducted for representing provenance of data [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], but few of
them have been focused on integrating provenance information
into the Web of data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Providing this information as RDF
would make provenance meta-data more transparent and
interlinked with other sources, and it would also offer new
scenarios on evaluating trust and data quality on the top of it. In this
regard a W3C Provenance Incubator Group2 has been recently
established. The mission of the group is to “provide a
stateof-the art understanding and develop a roadmap in the area of
provenance for Semantic Web technologies, development, and
possible standardization”. Requirements for provenance on the
Web3, as well as several use cases and technical requirements
have been provided by the working group. A comprehensive
analysis of approaches and methodologies for publishing and
consuming provenance metadata on the Web is exposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Another research topic relevant to our work is the evaluation
of trust and data quality in wikis. Recent studies proposed
several different algorithms for wikis that would automatically
calculate users’ contributions and evaluate their quantity and
quality in order to study the authors’ behavior, produce trust
measures of the articles and find experts. WikiTrust [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a
project aimed at measuring the quality of author contributions
on Wikipedia. They developed a tool that computes the origin
and author of every word on a wiki page, as well as “a
measure of text trust that indicates the extent with which text
has been revised”4. On the same topic other researchers tried
2established in September 2009. http://www.w3.org/2005/Incubator/prov/
3http://www.w3.org/2005/Incubator/prov/wiki/User Requirements
4WikiTrust: http://wikitrust.soe.ucsc.edu/
to solve the problem of evaluating articles’ quality, not only
examining quantitatively the users’ history [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], but also using
social network analysis techniques [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>From our perspective, there is a need of publishing
provenance information as Linked Data from websites hosting a
wide source of information (such as Wikipedia). Yet, most
of the work on provenance of data is, either not focused on
integrating the information generated on the Web of data,
or mainly based on provenance for resource descriptions or
already structured data. On the other hand, the interesting work
done so far on analyzing trust and quality on wikis does not
take into account the importance of making the information
extracted available on the Web of data.</p>
    </sec>
    <sec id="sec-2">
      <title>III. BACKGROUND</title>
      <sec id="sec-2-1">
        <title>A. Using SIOC for wiki modelling</title>
        <p>
          The SIOC Ontology — Semantically-Interlinked Online
Communities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] — provides a model for representing online
communities and their contributions5. It is mainly centered
around the concepts of users, items and containers, so it can be
used to model content created by a particular user on several
platforms, enabling a distributed perspective to the
management of User-Generated Content on the Web. In particular, the
atomic elements of the Web applications described by SIOC
are called Items. They are grouped in Containers, that
can themselves be contained in other Containers. Finally,
every Container belongs to a Space. As an example,
a Site (subclass of Space) may contain a number of
Wikis (subclass of Container) and every Wiki contains
a set of WikiArticles (subclass of Item) generated by
UserAccounts. For more details about SIOC, we invite the
reader to consult the W3C Member Submission [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and its
online specification6.
        </p>
        <p>
          While the SIOC Types module provides several
subclasses of Container and Item, including Wiki and
WikiArticle, some characteristics of wikis required further
modelling. Hence, in our previous work [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] we extended the
SIOC Ontology to take into account such characteristics (e.g.
multi-authoring, versioning, etc.). Then, some tools to generate
and consume data from wikis using our model have also been
developed [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>B. The SIOC Actions module</title>
        <p>
          While SIOC represents the state of a community at a
given time, SIOC-actions [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] can be used to represent their
dynamics, i.e. how they evolve. Hence, SIOC provides a
document-centric view of online communities and
SIOCactions focuses on an action-centric view. More precisely,
the evolution of an online community is represented as a set
of actions, performed by a user (sioc:UserAccount), at
some time, and impacting a number of objects (sioc:Item).
SIOC-actions provides an extensible hierarchy of properties
for representing the effect of an action on its items, such
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5http://sioc-project.org</title>
      <p>6http://rdfs.org/sioc/spec/
as creates, modifies, uses, etc. Besides the SIOC
ontology, SIOC-actions relies on the vocabulary for Linking
Open Descriptions of Events (LODE)7. The core of the module
is the Action class (subclass of event:Event from the
Event Ontology) which is a timestamped event involving an
agent (e.g. a UserAccount) and a number of digital artifacts
(e.g. Items). For more details about SIOC Actions and its
implementation see the following Sec. IV.</p>
      <p>
        IV. REPRESENTING THE W7 MODEL USING RDFS/OWL
The W7 model is an ontological model created to describe
the semantics of data provenance [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It is a conceptual model
and to the best of our knowledge a RDFS/OWL representation
of this model has not been implemented yet. Hence we will
focus on an implementation of this model for the specific
context of wikis. As a comparison, in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the authors use
the example of Wikipedia to illustrate theoretically how their
proposed W7 model can capture domain or application specific
provenance.
      </p>
      <p>
        The W7 model is based on the Bunge’s Ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
furthermore it is built on the concept of tracking the history of
the events affecting the status of things during their life cycle.
In this particular case we consider the data life cycle. The
Bunge’s ontology, developed in 1977, is considered as one of
the main sources of constructs to model real systems and
information systems. Since the Bunge’s work is a theoretical work,
there has been some effort from the scientific community to
translate his work into machine readable ontologies8.
      </p>
      <p>
        The W7 model represents data provenance using seven
fundamental elements or interrogative words: what, when,
where, how, who, which, and why. It has been purposely built
with general and extensible principles, hence it is possible to
capture provenance semantics for data in different domains.
We refer to [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for a detailed description of the mappings
between W7 and Bunge’s models, and in Table I we provide
a summary of the W7 elements (as in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
      </p>
      <p>Looking at the structure of the W7 model it is clear the
motivation why we chose the SIOC Actions module as core of
our model. Most of the concepts in the Actions module are the
same as in the W7 model. Furthermore wikis are community
sites and the Actions module has been implemented to
represent dynamic, action-centric views of online communities.</p>
      <p>In the following sections we give a detailed description of
how we answered each of these seven questions.</p>
      <sec id="sec-3-1">
        <title>A. What</title>
        <p>The What element represents an event that affected data
during its life cycle. It is a change of state and the core of
the model. In this regard, there are three main events affecting
data: creation, modification and deletion. In the context of
wikis, each of them can appear: users can (1) add new
sentences (or characters), (2) remove sequences of characters,
or (3) modify characters by removing and then adding content</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7LODE Ontology specification — http://linkedevents.org/ontology/</title>
      <p>
        8Evermann J. provides an OWL description of the Bunge’s ontology at:
http://homepages.mcs.vuw.ac.nz/ jevermann/Bunge/v5/index.html
An event (i.e. change of state) that happens
to data during its life time
An action leading to the events. An event may
occur, when it is acted upon by another thing,
which is often a human or a software agent
Time or more accurately the duration of an
event
Locations associated with an event
Agents including persons or organizations
involved in an event
Instruments or software programs used in the
event
Reasons that explain why an event occurred
in the same position of the article. In addition, in systems like
Wikipedia, some other specific events can affect the data on the
wiki, for example “quality assessment” or “change in access
rights” of an article [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; however, they can be expressed with
the three broader types defined above.
      </p>
      <p>Since (1) wikis commonly provide a versioning mechanism
for their content and (2) every action on a wiki article leads
to the generation of a new article revision, the core event
describing our What element is the creation of an article
version. In particular we model this creation, and the related
modification of the latest version (i.e. the permalink), using
the SIOC-Actions model as shown in Listing 1.</p>
      <p>&lt;http://example.com/action?title=Dublin_Core#380106133&gt;
sioca:creates &lt;http://en.wikipedia.org/w/index.php?
title=Dublin_Core&amp;oldid=380106133&gt;;
sioca:modifies &lt;http://en.wikipedia.org/wiki/</p>
      <p>Dublin_Core&gt;;
a sioca:Action.</p>
      <p>Listing 1. Representing the ”What” element</p>
      <p>As we can see from the example above expressed
in Turtle syntax, we have a sioca:Action identified
by the URI hhttp://example.com/action?title=Dublin Core#
380106133i that leads to the creation of a revision of the main
wiki article about “Dublin Core”. The creation of a new
revision was originated by a modification (sioca:modifies)
of the main Wikipedia article hhttp://en.wikipedia.org/wiki/
Dublin Corei. Details about the type of event are exposed
in the next section about the How element, where we identify
the type of action involved in the event creation.</p>
      <p>B. How</p>
      <p>
        The How element in W7 is an equivalent to the Action
element from Bunge’s ontology, and describes the action
leading to an event. In wikis, the possible actions leading
to an event (i.e. the creation of a new revision) are all
the edits applied to a specific article revision. By analyzing
the diff between two subsequent revisions of a page, we
can identify the type of action involved in the creation of
the newer revision. In particular we focus on modelling the
following types of edits: Insertion, Update and Deletion of
both Sentences and References. With the term Sentence here
we refer to every sequence of characters that does not include
a reference or a link to another source, and with Reference
we refer to every action that involves a link or a so-called
Wikipedia reference. As discussed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], another type of
edit would be a Revert, or an undo of the effects of one or
more edits previously happening. However, in Wikipedia, a
revert does not restore a previous version of the article, but
creates a new version with content similar to the one from an
earlier selected version. In this regard, we decided to model a
revert as all the other edits, and not as a particular pattern. The
distinction between a revert and other types of action can be
yet identified, with an acceptable level of precision, by looking
at the user comment entered when doing the revert, since most
users add a related revert comment 9.
      </p>
      <p>Going further, and to represent provenance data for the
action involved in each wiki edit, we modelled the diffs
appearing between pages. To model the differences calculated
between subsequent revisions we created a lightweight
Diff ontology, inspired by the Changeset vocabulary10.
Yet, instead of describing changes to RDF statements, our
model aims at describing changes to plain text documents.
It provides a main class, the diff:Diff class, and six
subclasses: SentenceUpdate, SentenceInsertion,
SentenceDeletion and ReferenceUpdate,
ReferenceInsertion, ReferenceDeletion, based
on the previous How patterns.</p>
      <p>The main Diff class represents all information about
the change between two versions of a wiki page (see
Fig. 1). The Diff’s properties subjectOfChange and
objectOfChange point respectively to the version changed
by this diff and to the newly created version. Details about
the time and the creator of the change are provided
respectively by dc:created and sioc:has_creator.
Moreover, the comment about the change is provided by the
diff:comment property with range rdfs:Literal. In
9Note that we could also compare the n-1 and n+1 version of each page to
identify if a change is a revert
10The Changeset schema: http://purl.org/vocab/changeset/schema#
Figure 1 we also display a Diff class linking to another Diff
class. The latter represents one of the six Diff subclasses
described earlier in this section. Since a single diff between
two versions can be composed by several atomic changes (or
“sub-diffs”), a Diff class can then point to several subclasses
using the dc:hasPart property. Each Diff subclass can
have maximum one TextBlock removed and one added: if
it has both, then the type of change is an Update, otherwise
the type would be an Insertion or a Deletion.</p>
      <p>The TextBlock class is part of the Diff ontology and
represents a sequence of characters added or removed in a
specific position of a plain text document. It exposes the
content itself of this sequence of characters (content) and
a pointer to its position inside the document (lineNumber).
It is important to precise that usually the document content is
organized in sets of lines, as in wiki articles, but this class
is generic enough to be reusable with other types of text
organization. To note also that each of the six subclasses of
the Diff class inherit the properties defined for the parent
class, but unfortunately this is not displayed in Figure 1 for
space reasons.</p>
      <p>With the model presented it is possible to address an
important requirement for provenance: the reproducibility of
a process. Starting from an older revision of a wiki article,
just following the diffs between the newer revisions and the
TextBlocks added or removed, it is possible to reconstruct
the latest version of the article. This approach goes a step
further than just storing the different data versions: it provides
details of the entire process involved in the data life cycle.</p>
      <sec id="sec-4-1">
        <title>C. When</title>
        <p>The When element in W7 is equivalent to the Time element
from Bunge’s ontology, and obviously refers to the time an
event occurs, which is recorded in every wiki platform for page
edits. As depicted in Figure 1, each Diff class is linked to the
timestamp of the event using the dc:created property. The
same timestamp is also linked to each Diff subclass using
the same property (not shown in Fig. 1 for space reasons). The
time of the event is modelled with more detail in the Action
element as shown in the following Listing 2 11.</p>
        <p>&lt;http://example.com/action?title=Dublin_Core#380106133&gt;
dc:created "2010-08-21T06:36:17Z"ˆˆ&lt;http://www.w3.org
/2001/XMLSchema#dateTime&gt;;
lode:atTime [
a time:Instant;
time:inXSDDateTime "2010-08-21T06:36:17Z"ˆˆ&lt;http://
www.w3.org/2001/XMLSchema#dateTime&gt;.
];
a sioca:Action.</p>
        <p>
          Listing 2. Representing the ”When” element in Turtle syntax
In this context we consider actions to be instantaneous. As in
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] we track the instant that an action is taking effect on a wiki
(i.e. when a wiki page is saved). Usually, this creation time
is represented using dc:created. Another option, provided
by the LODE ontology, uses the lode:atTime property to
link to a class representing a time interval or an instant.
11For all the namespaces see: http://prefix.cc
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>D. Where</title>
        <p>The Where element represents the online “Space” or the
location associated with an event. In wikis, and in particular
in Wikipedia, this is one of the most controversial elements
of the W7 model. If the location of an article update might
be considered as the location of the user when updating the
content, then this information on Wikipedia is not completely
provided or accurate. Indeed we can extract this information
only from the IP address of the anonymous users but not
from all the Wikipedia users. To note that is possible to
link a sioc:UserAccount (e.g. hhttp://en.wikipedia.org/
wiki/User:96.245.230.136i) to the related IP address using the
SIOC ip_address property.</p>
        <p>E. Who</p>
        <p>The Who element describes an agent involved in an event,
therefore it includes a person or an organization. On a wiki it
represents the editor of a page, and it can be either a registered
user or an anonymous user. A registered user might also
have different roles in the Wikipedia site and, on this basis,
different permissions are granted to its account. With this work
we are only interested in keeping track of the user account
involved in each event, and not also in the role on the wiki.
Users are modelled with the sioc:UserAccount class and
linked to each sioca:Action, sioct:WikiArticle
and diff:Diff with the property sioc:has_creator. A
sioc:UserAccount represents a user account, in an online
community site, owned by a physical person or a group or an
organization (i.e. a foaf:Agent). Hence a physical person,
represented by a foaf:Person subclass of foaf:Agent,
can be linked to several sioc:UserAccount.</p>
        <p>The Which element represents the programs or the
instruments used in the event. In our particular case it is the software
used in editing the event, which might be a bot or the wiki
software used by the editor. Since there is not a direct and
precise way to identify whether the edit has been made by a
human or a bot, our model does not make this distinction. A
naive method could be to look at the username and check if
it contains the “bot” string.</p>
        <p>G. Why</p>
        <p>The Why element represents the reasons behind the event
occurrence. On Wikipedia it is defined by the justifications for
a change inserted by a user in the “comment” field. This is
not a mandatory field for the user when editing a wiki page
but the Wikipedia guidelines recommend to fill-in this text
field. We model the comment left by the user with a property
diff:comment linking the diff:Diff class to the related
rdfs:Literal.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. APPLICATION USING PROVENANCE DATA FROM WIKIPEDIA</title>
      <sec id="sec-5-1">
        <title>A. Collecting the data from the Web</title>
        <p>In order to validate and test our modelling solution for
provenance on wikis and in particular from the Wikipedia
website, we collected data from the English Wikipedia and the
DBpedia service. The DBpedia project12 since it extracts and
publishes structured information from the English Wikipedia,
is considered as its RDF export. Collecting data not only
from Wikipedia but also from the DBpedia source has an
important advantage: it directly provides us structured data
modelled with popular standard lightweight ontologies in RDF.
We use the DBpedia data especially for the categories that
hierarchically structure the articles on Wikipedia. We ran our
experiment collecting a portion of the Wikipedia articles, and
in particular the articles belonging to the whole hierarchy
under a given category. By doing this we could limit our
dataset only to articles strongly related with each other, and
collect a user community with the same interest in common.</p>
        <p>A PHP script has been developed to extract all the articles
belonging to a category and all its subcategories, and for each
article all its revision history. More in detail, this program:
Executes a SPARQL13 query over the DBpedia endpoint
to get the categories hierarchy;
Stores the categories hierarchy (modelled with the
SKOS14 vocabulary) in a local triplestore;
Queries again the DBpedia endpoint to get all the articles
belonging to the categories collected;
For all the articles collected it generates (and stores
locally) RDF data using the SIOC-MediaWiki exporter15;
Using the sioc:previous_version property it
exports RDF for all the previous revisions of each article.
It is clear the advantage of using DBpedia in this process since
we collected structured data just executing two lightweight
SPARQL queries.</p>
        <p>A second PHP script has been developed to extract detailed
provenance information from the articles collected with the
previous step. This script calculates the diff function between
consecutive versions of the articles, and retrieves more related
information from the Wikipedia API. The data retrieved from
the API is composed by all the information needed for the
creation of the model described in the previous section. Therefore
information about the editor, the timestamp, the comment and
the ID of the versions are identified. Moreover the algorithm
is not only capable of extracting the diff function, but also
12http://dbpedia.org
13Query Language for RDF: http://www.w3.org/TR/rdf-sparql-query/
14SKOS Reference: http://www.w3.org/TR/skos-reference/
15http://ws.sioc-project.org/mediawiki/
to compute the type of change for each of the differences
identified. This allows us to mark each change with one of the</p>
      </sec>
      <sec id="sec-5-2">
        <title>Sentence or Reference Insertion/Update/Deletion subclasses</title>
        <p>of the diff:Diff class. Finally the script generates RDF
data with the model described before and inserts it in the
local triplestore. In order to test our application we ran the
data extraction algorithm starting from the category “Semantic
Web” on the English Wikipedia, and we generated data for
all the 166 wiki articles belonging to this category and its
subcategories recursively. As we can see, using Semantic Web
technologies, we have the advantage of having a single and
standard language to query wiki and provenance data together,
while developers that need to query original systems have to
learn a new API for each new system we want to query.</p>
      </sec>
      <sec id="sec-5-3">
        <title>B. A Firefox plug-in for provenance from Wikipedia</title>
        <p>In order to show the potential of the data collected and
the data model created, we built an application to show some
interesting statistics extracted from provenance information of
the analyzed articles. The application displays a table directly
on the top of each Wikipedia article exposing some
information about the most active users on the article and their edits.
In particular this has been developed using a Greasemonkey16
script: a Mozilla Firefox extension that allows users to install
scripts that make on-the-fly changes to HTML web page
content. This script is developed in JavaScript language and
is now compatible with other popular Web browsers. The
structure of the application is then composed by the following
elements: 1) The triplestore containing the data collected and
exposing a SPARQL endpoint for querying the data; 2) A
PHP script, used as an interface between the Greasemonkey
script and the triplestore; 3) A Greasemonkey script, which
retrieves the URL of the Wikipedia loaded page, sends the
request to the PHP script and then displays the returned
HTML data on the Wikipedia page. The PHP script in this
application is important because it is responsible for executing
the SPARQL queries on the triplestore. Furthermore it retrieves
the results and creates the HTML code to embed on the
Wikipedia page. A screenshot of the result of the process is
displayed in Figure 3.</p>
        <p>The tables displayed in Figure 3 appear only on the top of
the Wikipedia articles and categories that we analyzed with the
method described in Section V-A. A different type of table is
showed when the page visited is a category page. In Figure 3
on the top table, we can see the top six users who did the
biggest number of edits on the article. For each of these users
we then compute: (1) their total number of edits on the page;
(2) their percentage of “ownership” on the page (or better, the
percentage of their edits compared to all the edits done on the
article); (3) their number of lines added on the article; (4) their
number of lines removed on the article; (5) their total number
of lines added and removed on all the articles belonging to
the category “Semantic Web”. With the other use-case, when
the user visits a Wikipedia category page, we display different
16http://www.greasespot.net/
types of information but using the same method. See the table
on the bottom in Figure 3. Browsing a wiki category page, the
application shows a list of the users with the biggest number
of edits on the articles of the whole category (and related
subcategories). It also shows the related percentages of their
edits compared to the total edits on the category. The second
table on the right exposes a list of the most edited articles in
the category during the last three months. To note also that
at the bottom of each table there is a link pointing to a page
where a longer list of results will be displayed.</p>
        <p>
          At the moment the PHP script developed is available at http:
//vmuss06.deri.ie/WikiProvenance/index.php. Just using this
script is possible to have the same information displayed
using the Greasemonkey script and also to have the RDF
descriptions of the page requested. In order to represent these
statistical information in RDF, we use SCOVO, the Statistical
Core Vocabulary [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It relies on the concept of Item and
dimensions to represent statistical information. In our context,
the item is one piece of statistical information (e.g. user
“X” edited 10 lines on page “Y”), and various items are
involved in the description: (1) the type of information that
we want to represent (number of edits, percentage, lines added
and removed etc.); (2) the page or the category impacted;
(3) the user involved. Hence, we created four instances of
scv:Dimension to represent the first dimension, and relied
then simply on the scv:dimension property for the other
ones. As an example, the following snippet represents that the
user KingsleyIdehen made 11 edits on the SIOC page.
ex:123 a scovo:Item ;
rdf:value 11 ;
scv:dimension :Edits ;
scv:dimension &lt;http://wikipedia.org/wiki/SIOC&gt;;
scv:dimension &lt;http://wikipedia.org/wiki/User:
        </p>
        <p>KingsleyIdehen&gt;.</p>
        <p>Listing 3. Representing the number of edits by a user with SCOVO</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VI. CONCLUSION AND FUTURE WORK</title>
      <p>The goal of this paper was to provide a solution for
representing and managing provenance of data from Wikipedia
(and other wikis) using Semantic Web technologies. To solve
this problem we provided: a specific lightweight ontology for
provenance in wikis, based on the W7 model; a framework
for the extraction of provenance data from
Wikipedia; an
application for accessing the generated data in a meaningful
way and exposing it to the Web of data. We showed that
the W7 model is a good choice for modelling provenance
information in general and in wikis but, because of its high
abstraction level, it has to be refined using for instance other
specific lightweight ontologies. In our case this has been done
using SIOC and the Actions module. Future developments will
include a refinement of the proposed model and a subsequent
alignment with other general-purpose ontologies for
representing provenance as Linked Data (e.g. the Open Provenance
Model). We also plan to improve and extend the potentialities
of our application offering more features, and providing a
wider range of data with an architecture that automatically
updates the data as soon as it changes on Wikipedia.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>SIOC</given-names>
            <surname>Core Ontology</surname>
          </string-name>
          <article-title>Specification</article-title>
          .
          <source>W3C Member Submission 12 June</source>
          <year>2007</year>
          , World Wide Web Consortium,
          <year>2007</year>
          . http://www.w3.org/ Submission/sioc-spec/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.T.</given-names>
            <surname>Adler</surname>
          </string-name>
          , L. de Alfaro, I. Pye, and
          <string-name>
            <given-names>Vishwanath</given-names>
            <surname>Raman</surname>
          </string-name>
          .
          <article-title>Measuring author contributions to the wikipedia</article-title>
          .
          <source>In Proceedings of WikiSym '08. ACM</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Mario</given-names>
            <surname>Bunge</surname>
          </string-name>
          .
          <source>Treatise on Basic Philosophy: Ontology I: The Furniture of the World. Riedel</source>
          , Boston,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.A.</given-names>
            <surname>Champin</surname>
          </string-name>
          and
          <string-name>
            <surname>A. Passant.</surname>
          </string-name>
          <article-title>SIOC in Action - Representing the Dynamics of Online Communities</article-title>
          .
          <source>In Proceedings of the 6th International Conference on Semantic Systems (I-SEMANTICS</source>
          <year>2010</year>
          ). ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Golbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Trust networks on the semantic web</article-title>
          .
          <source>Cooperative Information Agents VII</source>
          , pages
          <fpage>238</fpage>
          -
          <lpage>249</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Hartig</surname>
          </string-name>
          .
          <article-title>Provenance information in the web of data</article-title>
          .
          <source>In 2nd Workshop on Linked Data on the Web (LDOW</source>
          <year>2009</year>
          )
          <string-name>
            <surname>at</surname>
            <given-names>WWW</given-names>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Hartig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Publishing and Consuming Provenance Metadata on the Web of Linked Data</article-title>
          .
          <source>In Proceedings of 3rd Int. Provenance and Annotation Workshop</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W</given-names>
            <surname>Halb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Raimond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Feigenbaum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D</given-names>
            <surname>Ayers</surname>
          </string-name>
          . SCOVO:
          <article-title>Using statistics on the Web of data</article-title>
          .
          <source>In Semantic Web in Use Track of the 6th European Semantic Web Conference (ESWC2009)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B</given-names>
            <surname>Hoisl</surname>
          </string-name>
          , W Aigner, and
          <string-name>
            <given-names>S</given-names>
            <surname>Miksch</surname>
          </string-name>
          .
          <article-title>Social Rewarding in Wiki SystemsMotivating the Community</article-title>
          .
          <source>In Proceedings of the 2nd international conference on Online communities and social computing</source>
          , pages
          <fpage>362</fpage>
          -
          <lpage>371</lpage>
          . Springer-Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>NT</given-names>
            <surname>Korfiatis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Poulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G</given-names>
            <surname>Bokos</surname>
          </string-name>
          .
          <article-title>Evaluating authoritative sources using social networks: an insight from Wikipedia</article-title>
          .
          <source>Online Information Review</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Orlandi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Passant</surname>
          </string-name>
          .
          <article-title>Enabling cross-wikis integration by extending the SIOC ontology</article-title>
          .
          <source>In 4th Semantic Wiki Workshop (SemWiki</source>
          <year>2009</year>
          ).
          <source>CEUR-WS</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Orlandi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Passant</surname>
          </string-name>
          .
          <source>Semantic Search on Heterogeneous Wiki Systems. In International Symposium on Wikis (WikiSym2010)</source>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Sudha</given-names>
            <surname>Ram</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Understanding the semantics of data provenance to support active conceptual modeling</article-title>
          , pages
          <fpage>17</fpage>
          -
          <lpage>29</lpage>
          . Springer Berlin / Heidelberg, lncs edition,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sudha</given-names>
            <surname>Ram</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>A New Perspective on Semantics of Data Provenance</article-title>
          .
          <source>In First International Workshop on the role of Semantic Web in Provenance Management (SWPM</source>
          <year>2009</year>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.L.</given-names>
            <surname>Simmhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plale</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gannon</surname>
          </string-name>
          .
          <article-title>A survey of data provenance techniques</article-title>
          . Computer Science Department, Indiana University, Bloomington IN,
          <volume>47405</volume>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>