=Paper=
{{Paper
|id=Vol-356/paper-4
|storemode=property
|title=Hyperstructure Maintenance Costs in Large-scale Wikis
|pdfUrl=https://ceur-ws.org/Vol-356/paper4.pdf
|volume=Vol-356
|dblpUrl=https://dblp.org/rec/conf/www/BoulainSG08
}}
==Hyperstructure Maintenance Costs in Large-scale Wikis==
Hyperstructure Maintenance Costs in Large-scale Wikis
Philip Boulain Nigel Shadbolt Nicholas Gibbins
prb@ecs.soton.ac.uk nrs@ecs.soton.ac.uk nmg@ecs.soton.ac.uk
Intelligence, Agents, Multimedia Group
School of Electronics and Computer Science, University of Southampton
Southampton SO17 1BJ, United Kingdom
ABSTRACT ‘memex’, and defines the “essential feature” of it as “the pro-
Wiki systems have developed over the past years as light- cess of tying two items together”. This linking between doc-
weight, community-editable, web-based hypertext systems. uments is the common feature of hypertext systems, upon
With the emergence of Semantic Wikis, these collections of which other improvements are built.
interlinked documents have also gained a dual role as ad- As well as simple binary (two endpoint) links, hyper-
hoc RDF [8] graphs. However, their roots lie at the limited text systems have been developed with features including
hypertext capabilities of the World Wide Web [1]: embedded n-ary links (multiple documents linked to multiple other
links, without support for composite objects or transclusion. documents), typed links (links which indicate something
In this paper, we present experimental evidence that hy- about why or how documents are related), generic links (links
perstructure changes, as opposed to content changes, form whose endpoints are determined by matching criteria of the
a substantial proportion of editing effort on a large-scale document content, such as particular words), and composite
wiki. The experiment is set in the wider context of a study documents, which are formed by combining a set of other,
of how the technologies developed during decades of hyper- linked, documents. Open Hypermedia extends this with
text research may be applied to improve management of interoperation, both with other hypermedia systems and
wiki document structure and, with semantic wikis, knowl- users, and with non-hypermedia resources. A key concept
edge structure. in open hypermedia is that of the non-embedded link—links
(and anchors) which are held external to the documents they
connect. These allow links to be made to immutable docu-
Categories and Subject Descriptors ments, and to be added and removed in sets, often termed
H.3.5 [Information Systems]: Information Storage and ‘linkbases’. One of the earliest projects attempting to im-
Retrieval—Online Information Services; H.5.3 [Informat- plement globally-distributed hypertext was Xanadu [9], a
ion Systems]: Information Interfaces and Presentation— distinctive feature of the design of which was transclusion:
Group and Organization Interfaces ; H.5.4 [Information including (sections of) a document into another by reference.
Systems]: Information Interfaces and Presentation—Hyper- In related work, we are currently investigating the rela-
text/Hypermedia tionship between an exemplar semantic wiki, Semantic Me-
diaWiki [6], and open hypermedia systems, as defined by the
General Terms Dexter Hypertext Reference Model [5]. Our preliminary re-
sults based on a formal description of Semantic MediaWiki
Experimentation in terms of the Dexter model suggest that such semantic
wikis can be treated as simple open hypermedia systems.
Keywords While details are beyond the scope of this paper, some ba-
sic parallels are evident: a wiki node is akin to a hyper-
Hypertext, Semantic Web, Wiki, Web Science
media document, and a semantic web resource. Semantic
wikis generally treat typed inter-node links as RDF state-
1. INTRODUCTION ments relating the nodes, and these links are embedded and
This experiment forms part of a broader project looking binary in hypermedia terms. From this we can see a mean-
into the potentially beneficial relationships between open hy- ingful similarity between a graph of documents connected
permedia, the study of interconnected documents; Semantic by typed links, and a graph of resources connected by RDF
Web, the study of interconnectable data; and ‘wikis’, web- statements. We can also see that wikis do not have features
based communal editing systems. covering more advanced hypermedia links: such as those
Hypermedia is a long-standing field of research into the which are not embedded, or have more than two endpoints.
ways in which documents can expand beyond the limitations This then suggests that semantic wikis stand to gain from
of paper, generally in terms of greater cross-referencing and techniques developed within hypermedia, but we must first
composition (reuse) capability. Bush’s As We May Think [2] judge if there is any substantial cost to be reduced. Hence
introduces the hypothetical early hypertext machine, the we have performed an quantitative experiment on a large-
Copyright is held by the Authors. Copyright transfered for publishing on- scale public wiki system to measure the proportion of effort
line and a conference CD ROM. expended on hyperstructure-related activities, as opposed to
SWKM’2008: Workshop on Social Web and Knowledge editing the document content.
Management @ WWW 2008, April 22, 2008, Beijing, China.
.
2. HYPOTHESIS cessed in a reasonable timeframe. Further iterations of the
We carried out an experiment to estimate the proportion experiment may study larger subsets of the data.
of effort expended maintaining the infrastructure around We performed categorisation on the revisions, into several
data, rather than the data itself, on a weak hypertext wiki edit types which would be automatically distinguished. In
system. We define a ‘weak’ hypertext system here as one particular, a simple equality comparison between a revision,
whose feature set is limited to embedded, unidirectional, bi- and the revision two edits previous, can detect the most
nary links, as with the World Wide Web. Our hypothesis common (anti-)abuse modification: the rollback, or revert
is that the manual editing of link structure, of a type which (unfortunately, MediaWiki does not record such operations
richer hypertext features could automate, will show to be a semantically). A sequence of reverts3 is usually indicative
significant overhead versus changes to the text content. of an ‘edit war’, where two users continually undo each-
This experiment also seeks to partially recreate a related, others changes in favour of their own. Page blanking was
informal experiment, discussed in an essay by Swartz [10]. also easy to detect, but identifying more complicated forms
of vandalism (e.g. misinformation, spam) was not feasible—
if reliable, automatic detection were possible, they would
3. DATASET not be present in the data, as Wikipedia could prevent such
We chose English Wikipedia1 as the experimental dataset, changes from being applied. Identifying abuse (and abuse
because it has both a considerably large and varied set of management) of the simpler types is important, as otherwise
documents, and a complete history of the editing processes— they would appear as very large changes.
performed by a wide range of Web users—between their first In order to detect changes in the text content, templates
and current versions2 . The wiki community keep the dataset used, MediaWiki categories, and links from a page, it was
fairly well inter-linked and categorised for cross-reference, necessary to attempt to parse the MediaWiki markup for-
but they do this via the cumulative efforts of a large body mat. Such ‘wikitext’, as it is known, is not a formally de-
of part-time editors. As well as being statistically significant, fined language: there is no grammar for it, and it does not
demonstrating possible improvement of English Wikipedia is appear likely that an unambiguous grammar actually ex-
socially significant, as it is a widely-used and active resource. ists. MediaWiki does not have a parser in the same way
It is important to stress the size of the English Wikipedia as processing tools such as compilers and XML libraries;
dataset. Wikipedia make available ‘dumps’ of their database instead it just has a long and complicated set of text sub-
in an ad-hoc XML format; because this study is interested stitution procedures which convert parts of ‘wikitext’ into
in the progression of page contents across revisions, it was display-oriented HTML. These substitutions often interact
necessary to use the largest of these dumps, containing both in a ill-defined manner, generally resulting in either more
page full-text and history (unfortunately, also non-encyclo- special-case substitutions, or as being defined as a new, hy-
pædic pages, such as discussions and user pages). This brid, feature, which editors then use. Because of these prob-
dump is provided compressed using the highly space-efficient lems, and the lack of abstraction in MediaWiki’s ‘parser’, as
(although time-complex) bzip2 algorithm; even then, it is much as the programming language boundary, a ‘scraping’
84.6GB. The total size of the XML file is estimated to be in parser was created which attempted to approximate partial
the region of two terabytes. processing of the wikitext format and return mostly correct
results. This parser is a single-pass state machine (42 states)
with a few additional side-effects. This yields excellent per-
4. PROCEDURE formance: testing showed that the time spent parsing is
Figure 1 shows the simplified data flow of the processing dominated by the time performing decompression.
of the dump performed for the experiment. To determine if an edit included a significant (‘major’)
First, we trimmed down the dataset to just those pages change to the text content, we required a difference metric
which are encyclopædic articles, as these are the pages of between the plaintext of the revisions. This metric was then
greatest significance to the Wikipedia project’s goals, and compared to a threshold to classify edits as being content
thus the most important to study. Otherwise, the dataset changes or not (in particular, the imperfect parser generates
would include a lot of ‘noise’ in the form of discussion and ‘noise’ from some non-content changes, as it cannot correctly
user pages, which are likely to have different editing pat- remove all the markup). The default threshold was chosen as
terns, and be less connected to the hyperstructure. The 5%: sentences in the English language are generally around
most practical way to do this was to remove any page placed twenty words in length, so this considers anything up to
in a namespace. On English Wikipedia, this also has the ef- changing one word in each sentence as non-major (minor).
fect of removing other page types, such as media and image MediaWiki also allows registered users to explicitly state
descriptions, help pages copied from MetaWiki, front-page than an edit is minor; this flag was respected where present.
portal components, and templates. As this stage also re- We chose an approximation of Levenshtein distance[7], as
quired decompressing the data, it ran over the course of it is a simple measure of insertions, deletions, and substitu-
several days on a multi-processor server. tions, fitting the kind of edit operations performed on the
We took a random subset of the data for processing. Sam- wiki. However, the algorithm for computing Levenshtein
ples of 0.04% and 0.01% of pages (approximately: see the de- itself was far too time-complex, even with aggressive opti-
scription of the subset tool below; actual page counts 14,215 misation, taking two minutes on a tiny test set of just a few
and 3,589 respectively) were selected, yielding a compressed thousand revisions of a single page (before trimming away
dataset which would fit on a CD-ROM, and could be pro- the identical parts at either end of both strings to take ad-
1 vantage of edit locality, this took 45 minutes). The problem
http://en.wikipedia.org/
2 3
MediaWiki, unlike many wikis, never deletes old revisions e.g. http://en.wikipedia.org/w/index.php?title=
of a page. Anarchism&diff=next&oldid=320139
BC zcat
bzcat gzip zcat gzip
Downloaded dump / Trim to articles / Trimmed / Random subset / Subset
GF
BC
Nearby revision comparison
GF GF GF
WikiMarkup parsing
BC BC BC BC
Abuse? Content? Templates? Categories? Links?
GF
Categorisation / Aggregation / Statistics
Figure 1: Data flow of Wikipedia experiment
was that the matrix-based approach is O(n × m), where n show it to be larger, with errors in the tens, and a peak er-
and m are the string lengths, in all cases: for n and m in ror of over two-hundred. The reason for such large errors is
the region of 16,000 characters, as found on many revisions, unclear, as the resynchronisation approach should also keep
merely iterating through all 256 million matrix cells was pro- error localised, but it does not greatly affect the result for
hibitively expensive. the purpose of minor/major determination: the majority of
Instead, we developed a new approach to computing such changes were correctly classified.
a distance, taking advantage of the domain-specific knowl- Identifying changes to links, etc. was significantly simpler,
edge that the two strings being compared are likely very and merely required comparing the sets of links identified
similar save for ‘local’ edits: the difference is likely to be by the parser across two revisions. These categorisations
a new paragraph, or a removed sentence, or some changed yielded simple information on which kinds of changes were
punctuation. Instead of efficient search within the space of made by each revision, and removed much of the ‘bulk’ of the
editing operations, as Levenshtein, it is based on the idea dataset (the revision texts); as a result, simple scripts could
of “sliding windows”: a pass is made over both strings in then handle the data to aggregate it into various groupings
parallel; when characters begin to differ, a look-back ‘win- in memory, so as to produce graph data and statistics for
dow’ is opened between the point at which differences began, analysis. Gnuplot4 was used to plot the graph data into
and continues until similarity is again found between these graphics as part of the build process for this paper.
windows. At this point, the position through the strings We identified the following non-mutually-exclusive group-
resynchronises, the distance is increased by the offset re- ings to usefully categorise edits:
quired, and the windows are again ‘closed’. When the end
of either string is reached by the far edge of the window, Revert Edit which simply undoes a previous edit.
the algorithm can terminate, as any remaining characters
in the other string must be unmatched and thus add to the Content Major (nontrivial) edit of the page content.
distance. As a result, the algorithm scales with regard to
the shorter of the two strings, which is helpful when revi- Minor Minor (trivial) edit of the page content.
sions may add whole paragraphs of new text to the end. To
reduce inaccuracy in certain cases, the algorithm maintains Category Edit to the categories of a page.
a ‘processed point’ cursor, to avoid double-counting of over-
lapping insertions and deletions. Pseudocode is presented as List of Edit to a page which is an index to other pages.
algorithm 1, which works on a pair of string buffers, and up-
str.c in the tool source contains a C implementation. This Indexing Edit to categories or listings, possibly both.
approach is still O(n × m) worst-case, but is O(n) (where
n is the shorter string) for identical strings, and degrades Template Edit to the templates used by a page.
smoothly as contiguous differences increase in size: instead
of two minutes, the tiny test set was compared in a little Page link Edit to an internal page link.
over ten seconds.
Unfortunately, changes such as ‘ABCF’ to ‘ADCDBCF’ URL link Edit to a WWW URL link; usually external.
can return overestimates, as the localisation which explic-
itly prevents full lookback (and keeps computational cost Links Edit to page or URL links.
below O(n2 )) causes the ‘C’ in ‘BCF’ to match with the ‘C’
in ‘DCD’: ‘ADC’ is considered a substitution of ‘ABC’ be- Link only As ‘links’, but excluding major edits.
fore the algorithm can realise that ‘BC’ is still intact in the
string, and ‘DCD’ is merely an insertion. As a result, the Hyperstructure Any hypermedia change: indexing, link-
later ‘B’ is considered an insertion, as it no longer matches ing, or template.
anything, and the distance is overestimated by one. Syn-
We expand upon the definition and significance of these
thetic tests showed this overestimation to be minor; tests
groups as needed in section 6.
against Levenshtein on a tiny subset of Wikipedia data (a
node’s first few hundred revisions, thus under heavy editing) 4
http://www.gnuplot.info/
5. TOOLS DEVELOPED
Algorithm 1 ‘Sliding window’ string distance metric To process the sizable dataset, we created a set of small,
procedure String-Distance(A, B) robust, stream-based tools in C. Stream-based processing
proc ← 0 . No. of chars. of string processed was a necessity, as manipulating the entire data in memory
procstr ← neither . Last string aligned upon at once was simply infeasible; instead, the tools are intended
dist ← 0 . Difference accumulator to be combined arbitrarily using pipes. We used standard
5: nearA ← f arA ← A . Near and far pointers compression tools to de- and re-compress the data for stor-
nearB ← f arB ← B age on disk, else the verbosity of the XML format caused
Let endA be the beyond-last character of buffer A, processing to be heavily I/O-bound.5 The open source Libxml26
and endB beyond B library was used to parse and regenerate the XML via its
procedure Scan(near, f ar) SAX interface. A selection of the more notable tools:
for scan ← near to before f ar do
10: if Chars. at scan and f ar same then dumptitles Converts a MediaWiki XML dump (henceforth,
return scan “MWXML”) into a plain, newline-separated, list of
return false page titles. Useful for diagnostics, e.g. confirming that
repeat the random subset contains an appropriate range of
synf arA ← Scan(nearA, f arA) pages.
15: synf arB ← Scan(nearB, f arB)
if synf arA ∨ synf arB then. Missed alignment discardnonart Reads in MWXML, and outputs MWXML,
if synf arA is further into A than synf arB sans any pages which are in a namespace; pedanti-
is into B then cally, due to the poor semantics of MWXML, those
f arA ← synf arA with colons in the title. This implements the “trim to
else articles” step of figure 1.
20: f arB ← synf arB randomsubset Reads and writes MWXML, preserving a
else if synf arA then random subset of the input pages. In order for this to
f arA ← synf arA be O(1) in memory consumption, this does not strictly
else if synf arB then provide a given proportion of the input; instead, the
f arB ← synf arB control is the probability of including a given page in
25: if Chars. at f arA and f arB same then the output. As a result, asking for 50% of the input
Aligned; calc. nears after proc. point may actually yield anywhere between none and all of
enA ← Min(nearA, A + proc − 1) the pages: it is just far more likely that the output will
enB ← Min(nearB, B + proc − 1) be around 50% of the input.7
Unaligned lengths
30: unA = positive dist. from enA to f arA categorise Reads MWXML and categorises the revisions,
unB = positive dist. from enB to f arB outputting results to a simple XML format.
procedure Align(un, f ar, buf f er, other)
distance ← distance + un cataggr A Perl script which processes the categorisation
proc = f ar’s distance into buf f er XML to produce final statistical results and graph data.
35: if procstr = other then By this point, the data are small enough that a SAX
proc ← proc + 1 parser is used to build a custom in-memory document
procstr ← buf f er tree, such that manipulation is easier.
if unA > unB then
The tools are available under the open source MIT license,
Align(unA, f arA, A, B)
and can be retrieved from http://users.ecs.soton.ac.uk/
40: else
prb/phd/wikipedia/ to recreate the experiment.
Align(unB, f arB, B, A)
if f arA = endA then . Ending
distance ← distance+ distance between 6. RESULTS
f arB and endB Because of the known error margin of the approximation
else if f arA = endA then of Levenshtein distance, we computed results from both gen-
45: distance ← distance+ distance between uine and approximated distances on the 0.01% subset, so as
f arA and endA to discover and illustrate the effects of approximation; the
else . Advanced with closed window computational cost difference between the algorithms was
nearA ← f arA ← f arA + 1 significant: two-and-a-half hours for genuine, eight minutes
nearB ← f arB ← f arB + 1 5
Specifically, GNU Zip for intermediate; bzip2, as originally
proc ← proc + 1 used by Wikipedia, made processing heavily CPU-bound.
50: else . Not aligned; widen windows 6
http://xmlsoft.org/
if f arA 6= endA then 7
A better algorithm, which is O(1) with regards to total
f arA ← f arA + 1 data size, but O(n) with regards to subset size, is to store a
if f arB 6= endB then buffer of up to n pages, and probabilistically replace them
f arB ← f arB + 1 with different pages as they are encountered. However, even
55: until f arA = endA ∨ f arB = endB this would be prohibitively memory intensive on statistically
return dist significant subset sizes, as each page may have thousands of
revisions, each with thousands of bytes of text, all of which
must be copied into the buffer.
Edit type Proportion Edit type Proportion Category Registered Unregistered Total
Categories 8.71% Categories 8.75% List of 1,146 453 1,599
Lists 1.97% Lists 3.72% Revert 4,069 679 4,748
Overhead 10.56% Overhead 12.34% Category 6,121 954 7,075
(a) 0.01% subset (b) 0.04% subset URL link 5,548 2,977 8,525
Indexing 7,174 1,397 8,571
Template 7,992 1,330 9,322
Table 1: Proportions of edits related to index man-
Content 10,275 4,182 14,457
agement
Minor 13,776 9,961 23,737
Link only 20,969 7,877 28,846
Page link 27,205 8,871 36,076
for approximated. Results were then generated from the
Links 29,671 10,606 40,277
more statistically significant 0.04% subset (27 hours). This
Hyperstructure 38,358 11,701 50,059
latter subset contained some pages on contentious topics,
which had seen large numbers of revisions as a result. Total 57,463 23,733 81,196
6.1 Index management Table 3: Categorisation of edits for 0.01% subset,
Levenshtein
Table 1 shows the proportions of edits in categories per-
taining to index management. “Categories” are changes to
the categories in which a page was placed. “Lists” are any
change to any ‘List of’ page; these pages serve as manually- of the sample set has not introduced significant error in this
maintained indices to other pages. “Overhead” are changes case, and it is reasonable to assume that a Levenshtein dis-
which fall into either of these categories: because they are tance comparison of the larger dataset would yield similar
not mutually exclusive (lists may be categorised), it is not a results to the 0.01% subset. Therefore, further discussion
sum of the other two values. Because these metrics do not will focus on the 0.01% subset with Levenshtein distance
consider the change in ‘content’ magnitude of a change, they results.
are unaffected by the choice of string distance algorithm. These figures show the significance of hyperstructure to
The ten percent overhead shows a strong case for the need Wikipedia, to a surprising degree. While we expected that
for stronger semantics and querying on Wikipedia; this is link editing would prove a substantial proportion of edits
one of the key goals, and expected benefits, of the Seman- compared to content, we did not anticipate that twice as
tic MediaWiki project. While virtually every ‘list of’ node many edits change links alone than those that change con-
could be replaced with a query on appropriate attributes, tent. Most link changes were page links—those to other
the gain in category efficiency is harder to measure. Any se- pages on the wiki, or metawiki—as opposed to URL links to
mantic wiki must still be provided with categorisation meta- arbitrary webpages (in some cases, pages on the wiki with
data such that the type of pages can be used to answer such special arguments). 36,076 edits modified the former, but
queries. However, some improvement is to be expected, as only 8,525 the latter.
there are current Wikipedia categories which could be in- With such a proportion of editing effort being expended
ferred: either because they are a union of other categories on modifying links on Wikipedia, there is a clear need to im-
(e.g. ‘Free software’ and ‘Operating systems’ cover the ex- prove this process. Introducing richer hypermedia features
isting category ‘Free software operating systems’) or because to wikis, such as generic links, should prove one possible
they are implied by a more specialised category, and no improvement. Generic links are links whose endpoints are
longer need to be explicitly applied to a page. defined by matching on criteria of the document content:
The increase in list overhead seen in the larger subset is a basic example being matching on a particular substring.
likely a result of having a more representative proportion of A generic link can specify that a page’s title should link
‘List of’ pages. Otherwise, the results are largely consistent to that page, rather than requiring users to manually an-
across sample sizes. notate it: some early wiki systems offered this capability,
but only for page titles which were written in the unnatural
6.2 Link management ‘CamelCase’ capitalisation. Advanced examples such as lo-
cal links, present in Microcosm [3, 4], can specify scope limits
Table 2 shows categories related to the management of
on the matching. This would help with ambiguous terms on
links. “Links” refers to edits which changed either page-
Wikipedia, such as ‘Interval’, which should be linked to a
to-page or page-to-URL links. “Links only” refers to such
specific meaning, such as ‘Interval (music)’.
edits excluding those edits which also constituted a ‘major’
content change: they are edits concerned only with links and
other structure. “Hyperstructure” is the category of edits 6.3 Overall editing distribution
which changed any of the navigational capabilities of the Table 3 shows the categorisation of all edits in the 0.01%
wiki: either categories, ‘List of’ pages, links, or templates. dataset, using Levenshtein for string distance, for registered
“Content” is simply the category of ‘major’ edits. and unregistered users. Note that the edit categories are
The overestimating effect of the approximate string dis- not mutually exclusive, thus will not sum to the total num-
tance algorithm can be seen as a greater proportion of edits ber of edits by that class of user. “Minor” is the category
being considered ‘major’, with a knock-on effect on reduc- of edits which did not appear to change anything substan-
ing the ratios of over edits over content edits. However, the tial: either the information extracted from the markup re-
results are consistent between the 0.01% subset with the mains the same, and the plaintext very similar; or a regis-
approximated string distance, and the sample set four times tered user annotated the edit as minor. Notably, over 5%
the size. As a result, it would appear that the smaller size of edits are reverts: edits completely rolling back the pre-
Edit type Proportion Edit type Proportion Edit type Proportion
Links 49.60% Links 49.60% Links 49.56%
Links only 35.53% Links only 23.36% Links only 25.24%
Hyperstructure 61.65% Hyperstructure 61.65% Hyperstructure 61.90%
Content 17.81% Content 35.60% Content 35.99%
Edit type Ratio over content Edit type Ratio over content Edit type Ratio over content
Links 2.79 Links 1.39 Links 1.38
Links only 2.00 Links only 0.71 Links only 0.70
Hyperstructure 3.46 Hyperstructure 1.73 Hyperstructure 1.72
(a) 0.01% subset, Levenshtein (b) 0.01% subset, Approximated (c) 0.04% subset, Approximated
Table 2: Proportions of edits related to link management
vious edit; this implies that a further 5% of edits are being semantically recorded by MediaWiki, it is largely indistin-
reverted (presumably as they are deemed unsuitable).8 A guishable from the addition of a paragraph of wikitext. As
substantial amount of effort is being expended merely keep- a result, it is not then possible to evaluate the cost of main-
ing Wikipedia ‘stationary’. taining or documenting these substitutions once the link to
Figure 2 demonstrates the distribution of users over the the original template has been lost.
total number of edits they have made, in the vein of the It is also not computationally feasible to detect the pattern
Swartz study [10]. There is a sharp falloff of number of users of a user performing the same fix on multiple pages, which
as the number of edits increases (note the logarithmic scale would identify the cost of inadequate, or underused, tran-
on both axes): by far, most users only ever make very few sclusion. Transclusion is an inclusion-by-reference mecha-
edits, whether registered or not. Unsurprisingly, registered nism, where a selected (fragment of a) document is included
users tend to make more edits overall, and unregistered users ‘live’ into another, greatly facilitating re-use.
are dominant at the scale of fewer than ten edits. In Wikipedia, it is often desirable to accompany a link to a
Figure 3 breaks the low-edit end of this distribution down page with a short summary of that page’s topic. In particu-
by basic categories. It is interesting to note that, other than lar, Wikipedia has many cases where articles include a sum-
being in close proximity (e.g. “content” and “page link”), the mary of another article, along with a “main article” link. The
lines do not have any definitive overlaps: the breakdown of ‘London’ page9 , for example, has many sections which con-
edits is consistent regardless of the number of edits the user sist primarily of summaries of more detailed pages, such as
has made. Users who have made 70 edits have made edits in ‘Education in London’. However, without some form of tran-
the same relative proportions (i.e., more “revert” than “list sclusion or composition to share text, if the main article’s
of”) as those who have only made five. summary changes—possibly because its subject changes—
Figure 4 shows how the magnitude of edits breaks down this change must be replicated manually out to any page
by the number of edits of that magnitude, again in the vein which also summarises it. A transclusion mechanism would
of Swartz [10]. Because this is clearly sensitive to the string allow a single summary of the subject to be shared by all
distancing algorithm, the 0.01% subset was used, with a fo- pages which reference it, including the main article on the
cus on Levenshtein: the approximate distance for all users subject, if desired.
is shown as a sparsely dotted line with a consistent over- For example, the ‘Education in London’ page may be-
estimate. These results are largely unsurprising: registered gin with a summary of its topic, highlighting the most no-
users make larger edits, and most edits are small, with the table institutions and successful research areas. The article
count rapidly falling off as magnitude increases. on ‘London’ may then, within its ‘Education’ section, tran-
sclude this summary from the ‘Education in London’ page.
6.4 Limitations of detection Should the summary be updated, perhaps because a Uni-
There are, unfortunately, several kinds of ‘overhead’ costs versity gains significant notability in a new research area,
which simply cannot be detected in a computationally fea- this change would be automatically reflected in the ‘Lon-
sible manner by this approach. For example, MediaWiki don’ page, as it is using the same text.
supports a feature called template ‘substitution’, which ac- While MediaWiki’s templates do function as transclusion,
tually imports the template, with parameter substitution they are not employed for this role: common usage and de-
performed (with some caveats), into the source text of the velopment effort focus on their use as preprocessing macros.
including node. It is important to note that the relationship
between the including and included nodes is lost, and that 7. CONCLUSIONS
the benefits of re-use (such as storage efficiency and later cor-
rections) are not available. The information regarding the The experiment consisted of the non-exclusive classifica-
origin of the text is also lost without manual documenta- tion of edits made throughout the history of Wikipedia,
tion effort, including any parameters required for the more a large and public wiki system. Classifications included
complicated templates. Because use of this feature is not both the areas of “text editing” (assumed to be primarily
maintaining the information content of Wikipedia: its ency-
8 clopædic articles), and “link editing” (maintaining the navi-
Actual figures may vary in either direction: this does not
detect rollbacks to versions earlier than the immediately pre- gational structure of the content). The hypothesis, that link
ceding version, and ‘edit wars’ of consecutive rollbacks will
9
be entirely included in the first 5%, not belonging in the http://en.wikipedia.org/w/index.php?title=
latter. London&oldid=155695080
105
All users
Registered
Unregistered
104
103
Number
of users
102
101
100
101 102 103
Edit count, ±5
Figure 2: User distribution over total number of edits made; 0.04% subset
105
Content
× Page link ×
URL link +
+ Template 4
104 4 Category 3
3
Revert
2 × List of 2
4
+
Number 3
of users 103 ×
4
3
+
×
2 4
+
3 ×
4
2 3
+
×
4
+
3
102 2 ×
4
3
+
2 2 ×
4
+ ×
+
4
3
3 ×
4
2 3
+
2 2 2
101
10 20 30 40 50 60 70 80 90
Edit count, ±5
Figure 3: User distribution over total number of edits made, by category; 0.04% subset
105
All users
Registered
Unregistered
104 All, approx.
103
Number
of edits
102
101
100
1000 2000 3000 4000 5000 6000 7000
Edit magnitude, ±50
Figure 4: Edit distribution over magnitude of edit; 0.01% subset
editing formed a substantial proportion of total editing ef- [3] H. Davis, W. Hall, I. Heath, G. Hill, and R. Wilkins.
fort, which may potentially be automated, was supported by Towards an integrated information environment with
the results. Twice as many edits changed links alone, not open hypermedia systems. In ECHT ’92: Proceedings
affecting the article text. Edits which maintained manual of the ACM conference on Hypertext, pages 181–190,
indexes of pages constituted approximately a tenth of total New York, NY, USA, 1992. ACM Press.
edits. [4] A. M. Fountain, W. Hall, I. Heath, and H. Davis.
We are continuing this work with a more detailed, small- MICROCOSM: An open model for hypermedia with
scale experiment, to understand better the patterns of real- dynamic linking. In European Conference on
world wiki editing. It is being treated as a knowledge elic- Hypertext, pages 298–311, 1990.
itation task, to gather information on the mental processes [5] F. Halasz and M. Schwartz. The Dexter hypertext
behind wiki editing: information on the tasks editors set reference model. Communications of the ACM,
themselves, and how their actions are used to achieve them. 37(2):30–39, 1994.
Our long-term goal is to continue this research by means of [6] M. Krötzsch, D. Vrandečić, and M. Völkel. Wikipedia
development and evaluation of a prototype system, informed and the semantic web - the missing links. In
by these studies, which can be used to test the hypothesis Proceedings of the WikiMania2005, 2005. Online at
that increased hypermedia features actually result in bene- http://www.aifb.uni-karlsruhe.de/WBS/mak/pub/
fits such as a decrease of editing overhead. wikimania.pdf.
[7] V. Levenshtein. Binary Codes Capable of Correcting
8. REFERENCES Deletions, Insertions and Reversals. Soviet Physics
[1] T. Berners-Lee, R. Cailliau, J.-F. Groff, and Doklady, 10:707, Feb 1966.
B. Pollermann. World-Wide Web: The Information [8] F. Manola and E. Miller. RDF Primer. Technical
Universe. Electronic Networking: Research, report, W3C, Feb 2004.
Applications and Policy, 1(2):74–82, 1992. [9] T. Nelson. Literary Machines. Mindful Press,
[2] V. Bush. As We May Think. The Atlantic Monthly, Sausalito, California, 93.1 edition, 1993.
176:101–108, Jul 1945. [10] A. Swartz. Who Writes Wikipedia? Online at http:
//www.aaronsw.com/weblog/whowriteswikipedia,
Sep 2006.