Hyperstructure Maintenance Costs in Large-scale Wikis

                 Philip Boulain                               Nigel Shadbolt                     Nicholas Gibbins
              prb@ecs.soton.ac.uk                          nrs@ecs.soton.ac.uk                 nmg@ecs.soton.ac.uk
                                            Intelligence, Agents, Multimedia Group
                           School of Electronics and Computer Science, University of Southampton
                                          Southampton SO17 1BJ, United Kingdom


ABSTRACT                                                                    ‘memex’, and defines the “essential feature” of it as “the pro-
Wiki systems have developed over the past years as light-                   cess of tying two items together”. This linking between doc-
weight, community-editable, web-based hypertext systems.                    uments is the common feature of hypertext systems, upon
With the emergence of Semantic Wikis, these collections of                  which other improvements are built.
interlinked documents have also gained a dual role as ad-                      As well as simple binary (two endpoint) links, hyper-
hoc RDF [8] graphs. However, their roots lie at the limited                 text systems have been developed with features including
hypertext capabilities of the World Wide Web [1]: embedded                  n-ary links (multiple documents linked to multiple other
links, without support for composite objects or transclusion.               documents), typed links (links which indicate something
   In this paper, we present experimental evidence that hy-                 about why or how documents are related), generic links (links
perstructure changes, as opposed to content changes, form                   whose endpoints are determined by matching criteria of the
a substantial proportion of editing effort on a large-scale                 document content, such as particular words), and composite
wiki. The experiment is set in the wider context of a study                 documents, which are formed by combining a set of other,
of how the technologies developed during decades of hyper-                  linked, documents. Open Hypermedia extends this with
text research may be applied to improve management of                       interoperation, both with other hypermedia systems and
wiki document structure and, with semantic wikis, knowl-                    users, and with non-hypermedia resources. A key concept
edge structure.                                                             in open hypermedia is that of the non-embedded link—links
                                                                            (and anchors) which are held external to the documents they
                                                                            connect. These allow links to be made to immutable docu-
Categories and Subject Descriptors                                          ments, and to be added and removed in sets, often termed
H.3.5 [Information Systems]: Information Storage and                        ‘linkbases’. One of the earliest projects attempting to im-
Retrieval—Online Information Services; H.5.3 [Informat-                     plement globally-distributed hypertext was Xanadu [9], a
ion Systems]: Information Interfaces and Presentation—                      distinctive feature of the design of which was transclusion:
Group and Organization Interfaces ; H.5.4 [Information                      including (sections of) a document into another by reference.
Systems]: Information Interfaces and Presentation—Hyper-                       In related work, we are currently investigating the rela-
text/Hypermedia                                                             tionship between an exemplar semantic wiki, Semantic Me-
                                                                            diaWiki [6], and open hypermedia systems, as defined by the
General Terms                                                               Dexter Hypertext Reference Model [5]. Our preliminary re-
                                                                            sults based on a formal description of Semantic MediaWiki
Experimentation                                                             in terms of the Dexter model suggest that such semantic
                                                                            wikis can be treated as simple open hypermedia systems.
Keywords                                                                    While details are beyond the scope of this paper, some ba-
                                                                            sic parallels are evident: a wiki node is akin to a hyper-
Hypertext, Semantic Web, Wiki, Web Science
                                                                            media document, and a semantic web resource. Semantic
                                                                            wikis generally treat typed inter-node links as RDF state-
1.    INTRODUCTION                                                          ments relating the nodes, and these links are embedded and
   This experiment forms part of a broader project looking                  binary in hypermedia terms. From this we can see a mean-
into the potentially beneficial relationships between open hy-              ingful similarity between a graph of documents connected
permedia, the study of interconnected documents; Semantic                   by typed links, and a graph of resources connected by RDF
Web, the study of interconnectable data; and ‘wikis’, web-                  statements. We can also see that wikis do not have features
based communal editing systems.                                             covering more advanced hypermedia links: such as those
   Hypermedia is a long-standing field of research into the                 which are not embedded, or have more than two endpoints.
ways in which documents can expand beyond the limitations                      This then suggests that semantic wikis stand to gain from
of paper, generally in terms of greater cross-referencing and               techniques developed within hypermedia, but we must first
composition (reuse) capability. Bush’s As We May Think [2]                  judge if there is any substantial cost to be reduced. Hence
introduces the hypothetical early hypertext machine, the                    we have performed an quantitative experiment on a large-
Copyright is held by the Authors. Copyright transfered for publishing on-   scale public wiki system to measure the proportion of effort
line and a conference CD ROM.                                               expended on hyperstructure-related activities, as opposed to
SWKM’2008: Workshop on Social Web and Knowledge                             editing the document content.
Management @ WWW 2008, April 22, 2008, Beijing, China.
.
2.   HYPOTHESIS                                                     cessed in a reasonable timeframe. Further iterations of the
   We carried out an experiment to estimate the proportion          experiment may study larger subsets of the data.
of effort expended maintaining the infrastructure around               We performed categorisation on the revisions, into several
data, rather than the data itself, on a weak hypertext wiki         edit types which would be automatically distinguished. In
system. We define a ‘weak’ hypertext system here as one             particular, a simple equality comparison between a revision,
whose feature set is limited to embedded, unidirectional, bi-       and the revision two edits previous, can detect the most
nary links, as with the World Wide Web. Our hypothesis              common (anti-)abuse modification: the rollback, or revert
is that the manual editing of link structure, of a type which       (unfortunately, MediaWiki does not record such operations
richer hypertext features could automate, will show to be a         semantically). A sequence of reverts3 is usually indicative
significant overhead versus changes to the text content.            of an ‘edit war’, where two users continually undo each-
   This experiment also seeks to partially recreate a related,      others changes in favour of their own. Page blanking was
informal experiment, discussed in an essay by Swartz [10].          also easy to detect, but identifying more complicated forms
                                                                    of vandalism (e.g. misinformation, spam) was not feasible—
                                                                    if reliable, automatic detection were possible, they would
3.   DATASET                                                        not be present in the data, as Wikipedia could prevent such
   We chose English Wikipedia1 as the experimental dataset,         changes from being applied. Identifying abuse (and abuse
because it has both a considerably large and varied set of          management) of the simpler types is important, as otherwise
documents, and a complete history of the editing processes—         they would appear as very large changes.
performed by a wide range of Web users—between their first             In order to detect changes in the text content, templates
and current versions2 . The wiki community keep the dataset         used, MediaWiki categories, and links from a page, it was
fairly well inter-linked and categorised for cross-reference,       necessary to attempt to parse the MediaWiki markup for-
but they do this via the cumulative efforts of a large body         mat. Such ‘wikitext’, as it is known, is not a formally de-
of part-time editors. As well as being statistically significant,   fined language: there is no grammar for it, and it does not
demonstrating possible improvement of English Wikipedia is          appear likely that an unambiguous grammar actually ex-
socially significant, as it is a widely-used and active resource.   ists. MediaWiki does not have a parser in the same way
   It is important to stress the size of the English Wikipedia      as processing tools such as compilers and XML libraries;
dataset. Wikipedia make available ‘dumps’ of their database         instead it just has a long and complicated set of text sub-
in an ad-hoc XML format; because this study is interested           stitution procedures which convert parts of ‘wikitext’ into
in the progression of page contents across revisions, it was        display-oriented HTML. These substitutions often interact
necessary to use the largest of these dumps, containing both        in a ill-defined manner, generally resulting in either more
page full-text and history (unfortunately, also non-encyclo-        special-case substitutions, or as being defined as a new, hy-
pædic pages, such as discussions and user pages). This              brid, feature, which editors then use. Because of these prob-
dump is provided compressed using the highly space-efficient        lems, and the lack of abstraction in MediaWiki’s ‘parser’, as
(although time-complex) bzip2 algorithm; even then, it is           much as the programming language boundary, a ‘scraping’
84.6GB. The total size of the XML file is estimated to be in        parser was created which attempted to approximate partial
the region of two terabytes.                                        processing of the wikitext format and return mostly correct
                                                                    results. This parser is a single-pass state machine (42 states)
                                                                    with a few additional side-effects. This yields excellent per-
4.   PROCEDURE                                                      formance: testing showed that the time spent parsing is
   Figure 1 shows the simplified data flow of the processing        dominated by the time performing decompression.
of the dump performed for the experiment.                              To determine if an edit included a significant (‘major’)
   First, we trimmed down the dataset to just those pages           change to the text content, we required a difference metric
which are encyclopædic articles, as these are the pages of          between the plaintext of the revisions. This metric was then
greatest significance to the Wikipedia project’s goals, and         compared to a threshold to classify edits as being content
thus the most important to study. Otherwise, the dataset            changes or not (in particular, the imperfect parser generates
would include a lot of ‘noise’ in the form of discussion and        ‘noise’ from some non-content changes, as it cannot correctly
user pages, which are likely to have different editing pat-         remove all the markup). The default threshold was chosen as
terns, and be less connected to the hyperstructure. The             5%: sentences in the English language are generally around
most practical way to do this was to remove any page placed         twenty words in length, so this considers anything up to
in a namespace. On English Wikipedia, this also has the ef-         changing one word in each sentence as non-major (minor).
fect of removing other page types, such as media and image          MediaWiki also allows registered users to explicitly state
descriptions, help pages copied from MetaWiki, front-page           than an edit is minor; this flag was respected where present.
portal components, and templates. As this stage also re-               We chose an approximation of Levenshtein distance[7], as
quired decompressing the data, it ran over the course of            it is a simple measure of insertions, deletions, and substitu-
several days on a multi-processor server.                           tions, fitting the kind of edit operations performed on the
   We took a random subset of the data for processing. Sam-         wiki. However, the algorithm for computing Levenshtein
ples of 0.04% and 0.01% of pages (approximately: see the de-        itself was far too time-complex, even with aggressive opti-
scription of the subset tool below; actual page counts 14,215       misation, taking two minutes on a tiny test set of just a few
and 3,589 respectively) were selected, yielding a compressed        thousand revisions of a single page (before trimming away
dataset which would fit on a CD-ROM, and could be pro-              the identical parts at either end of both strings to take ad-
1                                                                   vantage of edit locality, this took 45 minutes). The problem
 http://en.wikipedia.org/
2                                                                   3
 MediaWiki, unlike many wikis, never deletes old revisions            e.g.   http://en.wikipedia.org/w/index.php?title=
of a page.                                                          Anarchism&diff=next&oldid=320139
                                                                                                                       BC zcat
                                 bzcat                        gzip                  zcat                     gzip
             Downloaded dump             / Trim to articles          / Trimmed             / Random subset          / Subset

                      GF
                                                                                                                     BC
                                                        Nearby revision comparison


                                                  GF                         GF                    GF
                                                                             WikiMarkup parsing


                                              BC                        BC                        BC                 BC
                                                                                                                   
                   Abuse?                    Content?                Templates?              Categories?            Links?


                      GF
               Categorisation             / Aggregation              / Statistics

                                      Figure 1: Data flow of Wikipedia experiment


was that the matrix-based approach is O(n × m), where n                  show it to be larger, with errors in the tens, and a peak er-
and m are the string lengths, in all cases: for n and m in               ror of over two-hundred. The reason for such large errors is
the region of 16,000 characters, as found on many revisions,             unclear, as the resynchronisation approach should also keep
merely iterating through all 256 million matrix cells was pro-           error localised, but it does not greatly affect the result for
hibitively expensive.                                                    the purpose of minor/major determination: the majority of
   Instead, we developed a new approach to computing such                changes were correctly classified.
a distance, taking advantage of the domain-specific knowl-                 Identifying changes to links, etc. was significantly simpler,
edge that the two strings being compared are likely very                 and merely required comparing the sets of links identified
similar save for ‘local’ edits: the difference is likely to be           by the parser across two revisions. These categorisations
a new paragraph, or a removed sentence, or some changed                  yielded simple information on which kinds of changes were
punctuation. Instead of efficient search within the space of             made by each revision, and removed much of the ‘bulk’ of the
editing operations, as Levenshtein, it is based on the idea              dataset (the revision texts); as a result, simple scripts could
of “sliding windows”: a pass is made over both strings in                then handle the data to aggregate it into various groupings
parallel; when characters begin to differ, a look-back ‘win-             in memory, so as to produce graph data and statistics for
dow’ is opened between the point at which differences began,             analysis. Gnuplot4 was used to plot the graph data into
and continues until similarity is again found between these              graphics as part of the build process for this paper.
windows. At this point, the position through the strings                   We identified the following non-mutually-exclusive group-
resynchronises, the distance is increased by the offset re-              ings to usefully categorise edits:
quired, and the windows are again ‘closed’. When the end
of either string is reached by the far edge of the window,               Revert Edit which simply undoes a previous edit.
the algorithm can terminate, as any remaining characters
in the other string must be unmatched and thus add to the                Content Major (nontrivial) edit of the page content.
distance. As a result, the algorithm scales with regard to
the shorter of the two strings, which is helpful when revi-              Minor Minor (trivial) edit of the page content.
sions may add whole paragraphs of new text to the end. To
reduce inaccuracy in certain cases, the algorithm maintains              Category Edit to the categories of a page.
a ‘processed point’ cursor, to avoid double-counting of over-
lapping insertions and deletions. Pseudocode is presented as             List of Edit to a page which is an index to other pages.
algorithm 1, which works on a pair of string buffers, and up-
str.c in the tool source contains a C implementation. This               Indexing Edit to categories or listings, possibly both.
approach is still O(n × m) worst-case, but is O(n) (where
n is the shorter string) for identical strings, and degrades             Template Edit to the templates used by a page.
smoothly as contiguous differences increase in size: instead
of two minutes, the tiny test set was compared in a little               Page link Edit to an internal page link.
over ten seconds.
   Unfortunately, changes such as ‘ABCF’ to ‘ADCDBCF’                    URL link Edit to a WWW URL link; usually external.
can return overestimates, as the localisation which explic-
itly prevents full lookback (and keeps computational cost                Links Edit to page or URL links.
below O(n2 )) causes the ‘C’ in ‘BCF’ to match with the ‘C’
in ‘DCD’: ‘ADC’ is considered a substitution of ‘ABC’ be-                Link only As ‘links’, but excluding major edits.
fore the algorithm can realise that ‘BC’ is still intact in the
string, and ‘DCD’ is merely an insertion. As a result, the               Hyperstructure Any hypermedia change: indexing, link-
later ‘B’ is considered an insertion, as it no longer matches               ing, or template.
anything, and the distance is overestimated by one. Syn-
                                                                         We expand upon the definition and significance of these
thetic tests showed this overestimation to be minor; tests
                                                                         groups as needed in section 6.
against Levenshtein on a tiny subset of Wikipedia data (a
node’s first few hundred revisions, thus under heavy editing)            4
                                                                             http://www.gnuplot.info/
                                                                 5.   TOOLS DEVELOPED
Algorithm 1 ‘Sliding window’ string distance metric                 To process the sizable dataset, we created a set of small,
    procedure String-Distance(A, B)                              robust, stream-based tools in C. Stream-based processing
       proc ← 0            . No. of chars. of string processed   was a necessity, as manipulating the entire data in memory
       procstr ← neither          . Last string aligned upon     at once was simply infeasible; instead, the tools are intended
       dist ← 0                     . Difference accumulator     to be combined arbitrarily using pipes. We used standard
 5:    nearA ← f arA ← A              . Near and far pointers    compression tools to de- and re-compress the data for stor-
       nearB ← f arB ← B                                         age on disk, else the verbosity of the XML format caused
       Let endA be the beyond-last character of buffer A,        processing to be heavily I/O-bound.5 The open source Libxml26
          and endB beyond B                                      library was used to parse and regenerate the XML via its
       procedure Scan(near, f ar)                                SAX interface. A selection of the more notable tools:
          for scan ← near to before f ar do
10:            if Chars. at scan and f ar same then              dumptitles Converts a MediaWiki XML dump (henceforth,
                  return scan                                       “MWXML”) into a plain, newline-separated, list of
          return false                                              page titles. Useful for diagnostics, e.g. confirming that
       repeat                                                       the random subset contains an appropriate range of
          synf arA ← Scan(nearA, f arA)                             pages.
15:        synf arB ← Scan(nearB, f arB)
          if synf arA ∨ synf arB then. Missed alignment          discardnonart Reads in MWXML, and outputs MWXML,
              if synf arA is further into A than synf arB             sans any pages which are in a namespace; pedanti-
                  is into B then                                      cally, due to the poor semantics of MWXML, those
                  f arA ← synf arA                                    with colons in the title. This implements the “trim to
              else                                                    articles” step of figure 1.
20:                f arB ← synf arB                              randomsubset Reads and writes MWXML, preserving a
          else if synf arA then                                      random subset of the input pages. In order for this to
              f arA ← synf arA                                       be O(1) in memory consumption, this does not strictly
          else if synf arB then                                      provide a given proportion of the input; instead, the
              f arB ← synf arB                                       control is the probability of including a given page in
25:        if Chars. at f arA and f arB same then                    the output. As a result, asking for 50% of the input
               Aligned; calc. nears after proc. point               may actually yield anywhere between none and all of
              enA ← Min(nearA, A + proc − 1)                         the pages: it is just far more likely that the output will
              enB ← Min(nearB, B + proc − 1)                         be around 50% of the input.7
               Unaligned lengths
30:            unA = positive dist. from enA to f arA            categorise Reads MWXML and categorises the revisions,
              unB = positive dist. from enB to f arB                  outputting results to a simple XML format.
              procedure Align(un, f ar, buf f er, other)
                  distance ← distance + un                       cataggr A Perl script which processes the categorisation
                  proc = f ar’s distance into buf f er                XML to produce final statistical results and graph data.
35:                if procstr = other then                            By this point, the data are small enough that a SAX
                       proc ← proc + 1                                parser is used to build a custom in-memory document
                  procstr ← buf f er                                  tree, such that manipulation is easier.
              if unA > unB then
                                                                   The tools are available under the open source MIT license,
                  Align(unA, f arA, A, B)
                                                                 and can be retrieved from http://users.ecs.soton.ac.uk/
40:            else
                                                                 prb/phd/wikipedia/ to recreate the experiment.
                  Align(unB, f arB, B, A)
              if f arA = endA then                   . Ending
                  distance ← distance+ distance between          6.   RESULTS
                       f arB and endB                               Because of the known error margin of the approximation
              else if f arA = endA then                          of Levenshtein distance, we computed results from both gen-
45:                distance ← distance+ distance between         uine and approximated distances on the 0.01% subset, so as
                       f arA and endA                            to discover and illustrate the effects of approximation; the
              else           . Advanced with closed window       computational cost difference between the algorithms was
                  nearA ← f arA ← f arA + 1                      significant: two-and-a-half hours for genuine, eight minutes
                  nearB ← f arB ← f arB + 1                      5
                                                                   Specifically, GNU Zip for intermediate; bzip2, as originally
                  proc ← proc + 1                                used by Wikipedia, made processing heavily CPU-bound.
50:        else                . Not aligned; widen windows      6
                                                                   http://xmlsoft.org/
              if f arA 6= endA then                              7
                                                                   A better algorithm, which is O(1) with regards to total
                  f arA ← f arA + 1                              data size, but O(n) with regards to subset size, is to store a
              if f arB 6= endB then                              buffer of up to n pages, and probabilistically replace them
                  f arB ← f arB + 1                              with different pages as they are encountered. However, even
55:    until f arA = endA ∨ f arB = endB                         this would be prohibitively memory intensive on statistically
       return dist                                               significant subset sizes, as each page may have thousands of
                                                                 revisions, each with thousands of bytes of text, all of which
                                                                 must be copied into the buffer.
      Edit type    Proportion    Edit type    Proportion            Category           Registered     Unregistered     Total
      Categories        8.71%    Categories       8.75%             List of                 1,146              453     1,599
      Lists             1.97%    Lists            3.72%             Revert                  4,069              679     4,748
      Overhead         10.56%    Overhead        12.34%             Category                6,121              954     7,075
          (a) 0.01% subset           (b) 0.04% subset               URL link                5,548            2,977     8,525
                                                                    Indexing                7,174            1,397     8,571
                                                                    Template                7,992            1,330     9,322
Table 1: Proportions of edits related to index man-
                                                                    Content               10,275             4,182    14,457
agement
                                                                    Minor                 13,776             9,961    23,737
                                                                    Link only             20,969             7,877    28,846
                                                                    Page link             27,205             8,871    36,076
for approximated. Results were then generated from the
                                                                    Links                 29,671           10,606     40,277
more statistically significant 0.04% subset (27 hours). This
                                                                    Hyperstructure        38,358           11,701     50,059
latter subset contained some pages on contentious topics,
which had seen large numbers of revisions as a result.              Total                 57,463           23,733     81,196

6.1     Index management                                         Table 3: Categorisation of edits for 0.01% subset,
                                                                 Levenshtein
   Table 1 shows the proportions of edits in categories per-
taining to index management. “Categories” are changes to
the categories in which a page was placed. “Lists” are any
change to any ‘List of’ page; these pages serve as manually-     of the sample set has not introduced significant error in this
maintained indices to other pages. “Overhead” are changes        case, and it is reasonable to assume that a Levenshtein dis-
which fall into either of these categories: because they are     tance comparison of the larger dataset would yield similar
not mutually exclusive (lists may be categorised), it is not a   results to the 0.01% subset. Therefore, further discussion
sum of the other two values. Because these metrics do not        will focus on the 0.01% subset with Levenshtein distance
consider the change in ‘content’ magnitude of a change, they     results.
are unaffected by the choice of string distance algorithm.          These figures show the significance of hyperstructure to
   The ten percent overhead shows a strong case for the need     Wikipedia, to a surprising degree. While we expected that
for stronger semantics and querying on Wikipedia; this is        link editing would prove a substantial proportion of edits
one of the key goals, and expected benefits, of the Seman-       compared to content, we did not anticipate that twice as
tic MediaWiki project. While virtually every ‘list of’ node      many edits change links alone than those that change con-
could be replaced with a query on appropriate attributes,        tent. Most link changes were page links—those to other
the gain in category efficiency is harder to measure. Any se-    pages on the wiki, or metawiki—as opposed to URL links to
mantic wiki must still be provided with categorisation meta-     arbitrary webpages (in some cases, pages on the wiki with
data such that the type of pages can be used to answer such      special arguments). 36,076 edits modified the former, but
queries. However, some improvement is to be expected, as         only 8,525 the latter.
there are current Wikipedia categories which could be in-           With such a proportion of editing effort being expended
ferred: either because they are a union of other categories      on modifying links on Wikipedia, there is a clear need to im-
(e.g. ‘Free software’ and ‘Operating systems’ cover the ex-      prove this process. Introducing richer hypermedia features
isting category ‘Free software operating systems’) or because    to wikis, such as generic links, should prove one possible
they are implied by a more specialised category, and no          improvement. Generic links are links whose endpoints are
longer need to be explicitly applied to a page.                  defined by matching on criteria of the document content:
   The increase in list overhead seen in the larger subset is    a basic example being matching on a particular substring.
likely a result of having a more representative proportion of    A generic link can specify that a page’s title should link
‘List of’ pages. Otherwise, the results are largely consistent   to that page, rather than requiring users to manually an-
across sample sizes.                                             notate it: some early wiki systems offered this capability,
                                                                 but only for page titles which were written in the unnatural
6.2     Link management                                          ‘CamelCase’ capitalisation. Advanced examples such as lo-
                                                                 cal links, present in Microcosm [3, 4], can specify scope limits
   Table 2 shows categories related to the management of
                                                                 on the matching. This would help with ambiguous terms on
links. “Links” refers to edits which changed either page-
                                                                 Wikipedia, such as ‘Interval’, which should be linked to a
to-page or page-to-URL links. “Links only” refers to such
                                                                 specific meaning, such as ‘Interval (music)’.
edits excluding those edits which also constituted a ‘major’
content change: they are edits concerned only with links and
other structure. “Hyperstructure” is the category of edits       6.3    Overall editing distribution
which changed any of the navigational capabilities of the           Table 3 shows the categorisation of all edits in the 0.01%
wiki: either categories, ‘List of’ pages, links, or templates.   dataset, using Levenshtein for string distance, for registered
“Content” is simply the category of ‘major’ edits.               and unregistered users. Note that the edit categories are
   The overestimating effect of the approximate string dis-      not mutually exclusive, thus will not sum to the total num-
tance algorithm can be seen as a greater proportion of edits     ber of edits by that class of user. “Minor” is the category
being considered ‘major’, with a knock-on effect on reduc-       of edits which did not appear to change anything substan-
ing the ratios of over edits over content edits. However, the    tial: either the information extracted from the markup re-
results are consistent between the 0.01% subset with the         mains the same, and the plaintext very similar; or a regis-
approximated string distance, and the sample set four times      tered user annotated the edit as minor. Notably, over 5%
the size. As a result, it would appear that the smaller size     of edits are reverts: edits completely rolling back the pre-
   Edit type                Proportion          Edit type               Proportion       Edit type               Proportion
   Links                       49.60%           Links                       49.60%       Links                       49.56%
   Links only                  35.53%           Links only                  23.36%       Links only                  25.24%
   Hyperstructure              61.65%           Hyperstructure              61.65%       Hyperstructure              61.90%
   Content                     17.81%           Content                     35.60%       Content                     35.99%
   Edit type        Ratio over content          Edit type        Ratio over content      Edit type        Ratio over content
   Links                          2.79          Links                          1.39      Links                          1.38
   Links only                     2.00          Links only                     0.71      Links only                     0.70
   Hyperstructure                 3.46          Hyperstructure                 1.73      Hyperstructure                 1.72
      (a) 0.01% subset, Levenshtein               (b) 0.01% subset, Approximated           (c) 0.04% subset, Approximated

                                Table 2: Proportions of edits related to link management


vious edit; this implies that a further 5% of edits are being       semantically recorded by MediaWiki, it is largely indistin-
reverted (presumably as they are deemed unsuitable).8 A             guishable from the addition of a paragraph of wikitext. As
substantial amount of effort is being expended merely keep-         a result, it is not then possible to evaluate the cost of main-
ing Wikipedia ‘stationary’.                                         taining or documenting these substitutions once the link to
   Figure 2 demonstrates the distribution of users over the         the original template has been lost.
total number of edits they have made, in the vein of the               It is also not computationally feasible to detect the pattern
Swartz study [10]. There is a sharp falloff of number of users      of a user performing the same fix on multiple pages, which
as the number of edits increases (note the logarithmic scale        would identify the cost of inadequate, or underused, tran-
on both axes): by far, most users only ever make very few           sclusion. Transclusion is an inclusion-by-reference mecha-
edits, whether registered or not. Unsurprisingly, registered        nism, where a selected (fragment of a) document is included
users tend to make more edits overall, and unregistered users       ‘live’ into another, greatly facilitating re-use.
are dominant at the scale of fewer than ten edits.                     In Wikipedia, it is often desirable to accompany a link to a
   Figure 3 breaks the low-edit end of this distribution down       page with a short summary of that page’s topic. In particu-
by basic categories. It is interesting to note that, other than     lar, Wikipedia has many cases where articles include a sum-
being in close proximity (e.g. “content” and “page link”), the      mary of another article, along with a “main article” link. The
lines do not have any definitive overlaps: the breakdown of         ‘London’ page9 , for example, has many sections which con-
edits is consistent regardless of the number of edits the user      sist primarily of summaries of more detailed pages, such as
has made. Users who have made 70 edits have made edits in           ‘Education in London’. However, without some form of tran-
the same relative proportions (i.e., more “revert” than “list       sclusion or composition to share text, if the main article’s
of”) as those who have only made five.                              summary changes—possibly because its subject changes—
   Figure 4 shows how the magnitude of edits breaks down            this change must be replicated manually out to any page
by the number of edits of that magnitude, again in the vein         which also summarises it. A transclusion mechanism would
of Swartz [10]. Because this is clearly sensitive to the string     allow a single summary of the subject to be shared by all
distancing algorithm, the 0.01% subset was used, with a fo-         pages which reference it, including the main article on the
cus on Levenshtein: the approximate distance for all users          subject, if desired.
is shown as a sparsely dotted line with a consistent over-             For example, the ‘Education in London’ page may be-
estimate. These results are largely unsurprising: registered        gin with a summary of its topic, highlighting the most no-
users make larger edits, and most edits are small, with the         table institutions and successful research areas. The article
count rapidly falling off as magnitude increases.                   on ‘London’ may then, within its ‘Education’ section, tran-
                                                                    sclude this summary from the ‘Education in London’ page.
6.4    Limitations of detection                                     Should the summary be updated, perhaps because a Uni-
   There are, unfortunately, several kinds of ‘overhead’ costs      versity gains significant notability in a new research area,
which simply cannot be detected in a computationally fea-           this change would be automatically reflected in the ‘Lon-
sible manner by this approach. For example, MediaWiki               don’ page, as it is using the same text.
supports a feature called template ‘substitution’, which ac-           While MediaWiki’s templates do function as transclusion,
tually imports the template, with parameter substitution            they are not employed for this role: common usage and de-
performed (with some caveats), into the source text of the          velopment effort focus on their use as preprocessing macros.
including node. It is important to note that the relationship
between the including and included nodes is lost, and that          7.   CONCLUSIONS
the benefits of re-use (such as storage efficiency and later cor-
rections) are not available. The information regarding the             The experiment consisted of the non-exclusive classifica-
origin of the text is also lost without manual documenta-           tion of edits made throughout the history of Wikipedia,
tion effort, including any parameters required for the more         a large and public wiki system. Classifications included
complicated templates. Because use of this feature is not           both the areas of “text editing” (assumed to be primarily
                                                                    maintaining the information content of Wikipedia: its ency-
8                                                                   clopædic articles), and “link editing” (maintaining the navi-
  Actual figures may vary in either direction: this does not
detect rollbacks to versions earlier than the immediately pre-      gational structure of the content). The hypothesis, that link
ceding version, and ‘edit wars’ of consecutive rollbacks will
                                                                    9
be entirely included in the first 5%, not belonging in the            http://en.wikipedia.org/w/index.php?title=
latter.                                                             London&oldid=155695080
              105
                                                                                   All users
                                                                                 Registered
                                                                                Unregistered
              104


              103
   Number
   of users
              102


              101


              100
                             101                            102                                103
                                                      Edit count, ±5

      Figure 2: User distribution over total number of edits made; 0.04% subset


              105
                                                                                    Content
                    ×                                                              Page link         ×
                                                                                   URL link          +
                  +                                                                Template          4
              104 4                                                                Category          3
                  3
                                                                                     Revert
                    2        ×                                                       List of         2
                             4
                             +
   Number                    3
   of users   103                       ×
                                        4
                                        3
                                        +
                                                 ×
                             2                   4
                                                 +
                                                 3        ×
                                                          4
                                        2                 3
                                                          +
                                                                   ×
                                                                   4
                                                                   +
                                                                   3
              102                                2                          ×
                                                                            4
                                                                            3
                                                                            +
                                                          2        2                  ×
                                                                                      4
                                                                                      +        ×
                                                                                               +
                                                                                               4
                                                                                               3
                                                                                      3                   ×
                                                                                                          4
                                                                            2                             3
                                                                                                          +
                                                                                      2         2         2
              101
                        10         20       30       40       50       60        70       80         90
                                                      Edit count, ±5

Figure 3: User distribution over total number of edits made, by category; 0.04% subset
                            105
                                                                                        All users
                                                                                      Registered
                                                                                     Unregistered
                            104                                                      All, approx.


                            103
               Number
               of edits
                            102


                            101


                            100
                                        1000      2000         3000        4000      5000      6000       7000
                                                         Edit magnitude, ±50

                          Figure 4: Edit distribution over magnitude of edit; 0.01% subset


editing formed a substantial proportion of total editing ef-           [3] H. Davis, W. Hall, I. Heath, G. Hill, and R. Wilkins.
fort, which may potentially be automated, was supported by                 Towards an integrated information environment with
the results. Twice as many edits changed links alone, not                  open hypermedia systems. In ECHT ’92: Proceedings
affecting the article text. Edits which maintained manual                  of the ACM conference on Hypertext, pages 181–190,
indexes of pages constituted approximately a tenth of total                New York, NY, USA, 1992. ACM Press.
edits.                                                                 [4] A. M. Fountain, W. Hall, I. Heath, and H. Davis.
   We are continuing this work with a more detailed, small-                MICROCOSM: An open model for hypermedia with
scale experiment, to understand better the patterns of real-               dynamic linking. In European Conference on
world wiki editing. It is being treated as a knowledge elic-               Hypertext, pages 298–311, 1990.
itation task, to gather information on the mental processes            [5] F. Halasz and M. Schwartz. The Dexter hypertext
behind wiki editing: information on the tasks editors set                  reference model. Communications of the ACM,
themselves, and how their actions are used to achieve them.                37(2):30–39, 1994.
Our long-term goal is to continue this research by means of            [6] M. Krötzsch, D. Vrandečić, and M. Völkel. Wikipedia
development and evaluation of a prototype system, informed                 and the semantic web - the missing links. In
by these studies, which can be used to test the hypothesis                 Proceedings of the WikiMania2005, 2005. Online at
that increased hypermedia features actually result in bene-                http://www.aifb.uni-karlsruhe.de/WBS/mak/pub/
fits such as a decrease of editing overhead.                               wikimania.pdf.
                                                                       [7] V. Levenshtein. Binary Codes Capable of Correcting
8.   REFERENCES                                                            Deletions, Insertions and Reversals. Soviet Physics
 [1] T. Berners-Lee, R. Cailliau, J.-F. Groff, and                         Doklady, 10:707, Feb 1966.
     B. Pollermann. World-Wide Web: The Information                    [8] F. Manola and E. Miller. RDF Primer. Technical
     Universe. Electronic Networking: Research,                            report, W3C, Feb 2004.
     Applications and Policy, 1(2):74–82, 1992.                        [9] T. Nelson. Literary Machines. Mindful Press,
 [2] V. Bush. As We May Think. The Atlantic Monthly,                       Sausalito, California, 93.1 edition, 1993.
     176:101–108, Jul 1945.                                           [10] A. Swartz. Who Writes Wikipedia? Online at http:
                                                                           //www.aaronsw.com/weblog/whowriteswikipedia,
                                                                           Sep 2006.