Hyperstructure Maintenance Costs in Large-scale Wikis Philip Boulain Nigel Shadbolt Nicholas Gibbins prb@ecs.soton.ac.uk nrs@ecs.soton.ac.uk nmg@ecs.soton.ac.uk Intelligence, Agents, Multimedia Group School of Electronics and Computer Science, University of Southampton Southampton SO17 1BJ, United Kingdom ABSTRACT ‘memex’, and defines the “essential feature” of it as “the pro- Wiki systems have developed over the past years as light- cess of tying two items together”. This linking between doc- weight, community-editable, web-based hypertext systems. uments is the common feature of hypertext systems, upon With the emergence of Semantic Wikis, these collections of which other improvements are built. interlinked documents have also gained a dual role as ad- As well as simple binary (two endpoint) links, hyper- hoc RDF [8] graphs. However, their roots lie at the limited text systems have been developed with features including hypertext capabilities of the World Wide Web [1]: embedded n-ary links (multiple documents linked to multiple other links, without support for composite objects or transclusion. documents), typed links (links which indicate something In this paper, we present experimental evidence that hy- about why or how documents are related), generic links (links perstructure changes, as opposed to content changes, form whose endpoints are determined by matching criteria of the a substantial proportion of editing effort on a large-scale document content, such as particular words), and composite wiki. The experiment is set in the wider context of a study documents, which are formed by combining a set of other, of how the technologies developed during decades of hyper- linked, documents. Open Hypermedia extends this with text research may be applied to improve management of interoperation, both with other hypermedia systems and wiki document structure and, with semantic wikis, knowl- users, and with non-hypermedia resources. A key concept edge structure. in open hypermedia is that of the non-embedded link—links (and anchors) which are held external to the documents they connect. These allow links to be made to immutable docu- Categories and Subject Descriptors ments, and to be added and removed in sets, often termed H.3.5 [Information Systems]: Information Storage and ‘linkbases’. One of the earliest projects attempting to im- Retrieval—Online Information Services; H.5.3 [Informat- plement globally-distributed hypertext was Xanadu [9], a ion Systems]: Information Interfaces and Presentation— distinctive feature of the design of which was transclusion: Group and Organization Interfaces ; H.5.4 [Information including (sections of) a document into another by reference. Systems]: Information Interfaces and Presentation—Hyper- In related work, we are currently investigating the rela- text/Hypermedia tionship between an exemplar semantic wiki, Semantic Me- diaWiki [6], and open hypermedia systems, as defined by the General Terms Dexter Hypertext Reference Model [5]. Our preliminary re- sults based on a formal description of Semantic MediaWiki Experimentation in terms of the Dexter model suggest that such semantic wikis can be treated as simple open hypermedia systems. Keywords While details are beyond the scope of this paper, some ba- sic parallels are evident: a wiki node is akin to a hyper- Hypertext, Semantic Web, Wiki, Web Science media document, and a semantic web resource. Semantic wikis generally treat typed inter-node links as RDF state- 1. INTRODUCTION ments relating the nodes, and these links are embedded and This experiment forms part of a broader project looking binary in hypermedia terms. From this we can see a mean- into the potentially beneficial relationships between open hy- ingful similarity between a graph of documents connected permedia, the study of interconnected documents; Semantic by typed links, and a graph of resources connected by RDF Web, the study of interconnectable data; and ‘wikis’, web- statements. We can also see that wikis do not have features based communal editing systems. covering more advanced hypermedia links: such as those Hypermedia is a long-standing field of research into the which are not embedded, or have more than two endpoints. ways in which documents can expand beyond the limitations This then suggests that semantic wikis stand to gain from of paper, generally in terms of greater cross-referencing and techniques developed within hypermedia, but we must first composition (reuse) capability. Bush’s As We May Think [2] judge if there is any substantial cost to be reduced. Hence introduces the hypothetical early hypertext machine, the we have performed an quantitative experiment on a large- Copyright is held by the Authors. Copyright transfered for publishing on- scale public wiki system to measure the proportion of effort line and a conference CD ROM. expended on hyperstructure-related activities, as opposed to SWKM’2008: Workshop on Social Web and Knowledge editing the document content. Management @ WWW 2008, April 22, 2008, Beijing, China. . 2. HYPOTHESIS cessed in a reasonable timeframe. Further iterations of the We carried out an experiment to estimate the proportion experiment may study larger subsets of the data. of effort expended maintaining the infrastructure around We performed categorisation on the revisions, into several data, rather than the data itself, on a weak hypertext wiki edit types which would be automatically distinguished. In system. We define a ‘weak’ hypertext system here as one particular, a simple equality comparison between a revision, whose feature set is limited to embedded, unidirectional, bi- and the revision two edits previous, can detect the most nary links, as with the World Wide Web. Our hypothesis common (anti-)abuse modification: the rollback, or revert is that the manual editing of link structure, of a type which (unfortunately, MediaWiki does not record such operations richer hypertext features could automate, will show to be a semantically). A sequence of reverts3 is usually indicative significant overhead versus changes to the text content. of an ‘edit war’, where two users continually undo each- This experiment also seeks to partially recreate a related, others changes in favour of their own. Page blanking was informal experiment, discussed in an essay by Swartz [10]. also easy to detect, but identifying more complicated forms of vandalism (e.g. misinformation, spam) was not feasible— if reliable, automatic detection were possible, they would 3. DATASET not be present in the data, as Wikipedia could prevent such We chose English Wikipedia1 as the experimental dataset, changes from being applied. Identifying abuse (and abuse because it has both a considerably large and varied set of management) of the simpler types is important, as otherwise documents, and a complete history of the editing processes— they would appear as very large changes. performed by a wide range of Web users—between their first In order to detect changes in the text content, templates and current versions2 . The wiki community keep the dataset used, MediaWiki categories, and links from a page, it was fairly well inter-linked and categorised for cross-reference, necessary to attempt to parse the MediaWiki markup for- but they do this via the cumulative efforts of a large body mat. Such ‘wikitext’, as it is known, is not a formally de- of part-time editors. As well as being statistically significant, fined language: there is no grammar for it, and it does not demonstrating possible improvement of English Wikipedia is appear likely that an unambiguous grammar actually ex- socially significant, as it is a widely-used and active resource. ists. MediaWiki does not have a parser in the same way It is important to stress the size of the English Wikipedia as processing tools such as compilers and XML libraries; dataset. Wikipedia make available ‘dumps’ of their database instead it just has a long and complicated set of text sub- in an ad-hoc XML format; because this study is interested stitution procedures which convert parts of ‘wikitext’ into in the progression of page contents across revisions, it was display-oriented HTML. These substitutions often interact necessary to use the largest of these dumps, containing both in a ill-defined manner, generally resulting in either more page full-text and history (unfortunately, also non-encyclo- special-case substitutions, or as being defined as a new, hy- pædic pages, such as discussions and user pages). This brid, feature, which editors then use. Because of these prob- dump is provided compressed using the highly space-efficient lems, and the lack of abstraction in MediaWiki’s ‘parser’, as (although time-complex) bzip2 algorithm; even then, it is much as the programming language boundary, a ‘scraping’ 84.6GB. The total size of the XML file is estimated to be in parser was created which attempted to approximate partial the region of two terabytes. processing of the wikitext format and return mostly correct results. This parser is a single-pass state machine (42 states) with a few additional side-effects. This yields excellent per- 4. PROCEDURE formance: testing showed that the time spent parsing is Figure 1 shows the simplified data flow of the processing dominated by the time performing decompression. of the dump performed for the experiment. To determine if an edit included a significant (‘major’) First, we trimmed down the dataset to just those pages change to the text content, we required a difference metric which are encyclopædic articles, as these are the pages of between the plaintext of the revisions. This metric was then greatest significance to the Wikipedia project’s goals, and compared to a threshold to classify edits as being content thus the most important to study. Otherwise, the dataset changes or not (in particular, the imperfect parser generates would include a lot of ‘noise’ in the form of discussion and ‘noise’ from some non-content changes, as it cannot correctly user pages, which are likely to have different editing pat- remove all the markup). The default threshold was chosen as terns, and be less connected to the hyperstructure. The 5%: sentences in the English language are generally around most practical way to do this was to remove any page placed twenty words in length, so this considers anything up to in a namespace. On English Wikipedia, this also has the ef- changing one word in each sentence as non-major (minor). fect of removing other page types, such as media and image MediaWiki also allows registered users to explicitly state descriptions, help pages copied from MetaWiki, front-page than an edit is minor; this flag was respected where present. portal components, and templates. As this stage also re- We chose an approximation of Levenshtein distance[7], as quired decompressing the data, it ran over the course of it is a simple measure of insertions, deletions, and substitu- several days on a multi-processor server. tions, fitting the kind of edit operations performed on the We took a random subset of the data for processing. Sam- wiki. However, the algorithm for computing Levenshtein ples of 0.04% and 0.01% of pages (approximately: see the de- itself was far too time-complex, even with aggressive opti- scription of the subset tool below; actual page counts 14,215 misation, taking two minutes on a tiny test set of just a few and 3,589 respectively) were selected, yielding a compressed thousand revisions of a single page (before trimming away dataset which would fit on a CD-ROM, and could be pro- the identical parts at either end of both strings to take ad- 1 vantage of edit locality, this took 45 minutes). The problem http://en.wikipedia.org/ 2 3 MediaWiki, unlike many wikis, never deletes old revisions e.g. http://en.wikipedia.org/w/index.php?title= of a page. Anarchism&diff=next&oldid=320139 BC zcat bzcat gzip zcat gzip Downloaded dump / Trim to articles / Trimmed / Random subset / Subset GF BC Nearby revision comparison GF GF GF WikiMarkup parsing BC BC BC BC      Abuse? Content? Templates? Categories? Links? GF Categorisation / Aggregation / Statistics Figure 1: Data flow of Wikipedia experiment was that the matrix-based approach is O(n × m), where n show it to be larger, with errors in the tens, and a peak er- and m are the string lengths, in all cases: for n and m in ror of over two-hundred. The reason for such large errors is the region of 16,000 characters, as found on many revisions, unclear, as the resynchronisation approach should also keep merely iterating through all 256 million matrix cells was pro- error localised, but it does not greatly affect the result for hibitively expensive. the purpose of minor/major determination: the majority of Instead, we developed a new approach to computing such changes were correctly classified. a distance, taking advantage of the domain-specific knowl- Identifying changes to links, etc. was significantly simpler, edge that the two strings being compared are likely very and merely required comparing the sets of links identified similar save for ‘local’ edits: the difference is likely to be by the parser across two revisions. These categorisations a new paragraph, or a removed sentence, or some changed yielded simple information on which kinds of changes were punctuation. Instead of efficient search within the space of made by each revision, and removed much of the ‘bulk’ of the editing operations, as Levenshtein, it is based on the idea dataset (the revision texts); as a result, simple scripts could of “sliding windows”: a pass is made over both strings in then handle the data to aggregate it into various groupings parallel; when characters begin to differ, a look-back ‘win- in memory, so as to produce graph data and statistics for dow’ is opened between the point at which differences began, analysis. Gnuplot4 was used to plot the graph data into and continues until similarity is again found between these graphics as part of the build process for this paper. windows. At this point, the position through the strings We identified the following non-mutually-exclusive group- resynchronises, the distance is increased by the offset re- ings to usefully categorise edits: quired, and the windows are again ‘closed’. When the end of either string is reached by the far edge of the window, Revert Edit which simply undoes a previous edit. the algorithm can terminate, as any remaining characters in the other string must be unmatched and thus add to the Content Major (nontrivial) edit of the page content. distance. As a result, the algorithm scales with regard to the shorter of the two strings, which is helpful when revi- Minor Minor (trivial) edit of the page content. sions may add whole paragraphs of new text to the end. To reduce inaccuracy in certain cases, the algorithm maintains Category Edit to the categories of a page. a ‘processed point’ cursor, to avoid double-counting of over- lapping insertions and deletions. Pseudocode is presented as List of Edit to a page which is an index to other pages. algorithm 1, which works on a pair of string buffers, and up- str.c in the tool source contains a C implementation. This Indexing Edit to categories or listings, possibly both. approach is still O(n × m) worst-case, but is O(n) (where n is the shorter string) for identical strings, and degrades Template Edit to the templates used by a page. smoothly as contiguous differences increase in size: instead of two minutes, the tiny test set was compared in a little Page link Edit to an internal page link. over ten seconds. Unfortunately, changes such as ‘ABCF’ to ‘ADCDBCF’ URL link Edit to a WWW URL link; usually external. can return overestimates, as the localisation which explic- itly prevents full lookback (and keeps computational cost Links Edit to page or URL links. below O(n2 )) causes the ‘C’ in ‘BCF’ to match with the ‘C’ in ‘DCD’: ‘ADC’ is considered a substitution of ‘ABC’ be- Link only As ‘links’, but excluding major edits. fore the algorithm can realise that ‘BC’ is still intact in the string, and ‘DCD’ is merely an insertion. As a result, the Hyperstructure Any hypermedia change: indexing, link- later ‘B’ is considered an insertion, as it no longer matches ing, or template. anything, and the distance is overestimated by one. Syn- We expand upon the definition and significance of these thetic tests showed this overestimation to be minor; tests groups as needed in section 6. against Levenshtein on a tiny subset of Wikipedia data (a node’s first few hundred revisions, thus under heavy editing) 4 http://www.gnuplot.info/ 5. TOOLS DEVELOPED Algorithm 1 ‘Sliding window’ string distance metric To process the sizable dataset, we created a set of small, procedure String-Distance(A, B) robust, stream-based tools in C. Stream-based processing proc ← 0 . No. of chars. of string processed was a necessity, as manipulating the entire data in memory procstr ← neither . Last string aligned upon at once was simply infeasible; instead, the tools are intended dist ← 0 . Difference accumulator to be combined arbitrarily using pipes. We used standard 5: nearA ← f arA ← A . Near and far pointers compression tools to de- and re-compress the data for stor- nearB ← f arB ← B age on disk, else the verbosity of the XML format caused Let endA be the beyond-last character of buffer A, processing to be heavily I/O-bound.5 The open source Libxml26 and endB beyond B library was used to parse and regenerate the XML via its procedure Scan(near, f ar) SAX interface. A selection of the more notable tools: for scan ← near to before f ar do 10: if Chars. at scan and f ar same then dumptitles Converts a MediaWiki XML dump (henceforth, return scan “MWXML”) into a plain, newline-separated, list of return false page titles. Useful for diagnostics, e.g. confirming that repeat the random subset contains an appropriate range of synf arA ← Scan(nearA, f arA) pages. 15: synf arB ← Scan(nearB, f arB) if synf arA ∨ synf arB then. Missed alignment discardnonart Reads in MWXML, and outputs MWXML, if synf arA is further into A than synf arB sans any pages which are in a namespace; pedanti- is into B then cally, due to the poor semantics of MWXML, those f arA ← synf arA with colons in the title. This implements the “trim to else articles” step of figure 1. 20: f arB ← synf arB randomsubset Reads and writes MWXML, preserving a else if synf arA then random subset of the input pages. In order for this to f arA ← synf arA be O(1) in memory consumption, this does not strictly else if synf arB then provide a given proportion of the input; instead, the f arB ← synf arB control is the probability of including a given page in 25: if Chars. at f arA and f arB same then the output. As a result, asking for 50% of the input  Aligned; calc. nears after proc. point may actually yield anywhere between none and all of enA ← Min(nearA, A + proc − 1) the pages: it is just far more likely that the output will enB ← Min(nearB, B + proc − 1) be around 50% of the input.7  Unaligned lengths 30: unA = positive dist. from enA to f arA categorise Reads MWXML and categorises the revisions, unB = positive dist. from enB to f arB outputting results to a simple XML format. procedure Align(un, f ar, buf f er, other) distance ← distance + un cataggr A Perl script which processes the categorisation proc = f ar’s distance into buf f er XML to produce final statistical results and graph data. 35: if procstr = other then By this point, the data are small enough that a SAX proc ← proc + 1 parser is used to build a custom in-memory document procstr ← buf f er tree, such that manipulation is easier. if unA > unB then The tools are available under the open source MIT license, Align(unA, f arA, A, B) and can be retrieved from http://users.ecs.soton.ac.uk/ 40: else prb/phd/wikipedia/ to recreate the experiment. Align(unB, f arB, B, A) if f arA = endA then . Ending distance ← distance+ distance between 6. RESULTS f arB and endB Because of the known error margin of the approximation else if f arA = endA then of Levenshtein distance, we computed results from both gen- 45: distance ← distance+ distance between uine and approximated distances on the 0.01% subset, so as f arA and endA to discover and illustrate the effects of approximation; the else . Advanced with closed window computational cost difference between the algorithms was nearA ← f arA ← f arA + 1 significant: two-and-a-half hours for genuine, eight minutes nearB ← f arB ← f arB + 1 5 Specifically, GNU Zip for intermediate; bzip2, as originally proc ← proc + 1 used by Wikipedia, made processing heavily CPU-bound. 50: else . Not aligned; widen windows 6 http://xmlsoft.org/ if f arA 6= endA then 7 A better algorithm, which is O(1) with regards to total f arA ← f arA + 1 data size, but O(n) with regards to subset size, is to store a if f arB 6= endB then buffer of up to n pages, and probabilistically replace them f arB ← f arB + 1 with different pages as they are encountered. However, even 55: until f arA = endA ∨ f arB = endB this would be prohibitively memory intensive on statistically return dist significant subset sizes, as each page may have thousands of revisions, each with thousands of bytes of text, all of which must be copied into the buffer. Edit type Proportion Edit type Proportion Category Registered Unregistered Total Categories 8.71% Categories 8.75% List of 1,146 453 1,599 Lists 1.97% Lists 3.72% Revert 4,069 679 4,748 Overhead 10.56% Overhead 12.34% Category 6,121 954 7,075 (a) 0.01% subset (b) 0.04% subset URL link 5,548 2,977 8,525 Indexing 7,174 1,397 8,571 Template 7,992 1,330 9,322 Table 1: Proportions of edits related to index man- Content 10,275 4,182 14,457 agement Minor 13,776 9,961 23,737 Link only 20,969 7,877 28,846 Page link 27,205 8,871 36,076 for approximated. Results were then generated from the Links 29,671 10,606 40,277 more statistically significant 0.04% subset (27 hours). This Hyperstructure 38,358 11,701 50,059 latter subset contained some pages on contentious topics, which had seen large numbers of revisions as a result. Total 57,463 23,733 81,196 6.1 Index management Table 3: Categorisation of edits for 0.01% subset, Levenshtein Table 1 shows the proportions of edits in categories per- taining to index management. “Categories” are changes to the categories in which a page was placed. “Lists” are any change to any ‘List of’ page; these pages serve as manually- of the sample set has not introduced significant error in this maintained indices to other pages. “Overhead” are changes case, and it is reasonable to assume that a Levenshtein dis- which fall into either of these categories: because they are tance comparison of the larger dataset would yield similar not mutually exclusive (lists may be categorised), it is not a results to the 0.01% subset. Therefore, further discussion sum of the other two values. Because these metrics do not will focus on the 0.01% subset with Levenshtein distance consider the change in ‘content’ magnitude of a change, they results. are unaffected by the choice of string distance algorithm. These figures show the significance of hyperstructure to The ten percent overhead shows a strong case for the need Wikipedia, to a surprising degree. While we expected that for stronger semantics and querying on Wikipedia; this is link editing would prove a substantial proportion of edits one of the key goals, and expected benefits, of the Seman- compared to content, we did not anticipate that twice as tic MediaWiki project. While virtually every ‘list of’ node many edits change links alone than those that change con- could be replaced with a query on appropriate attributes, tent. Most link changes were page links—those to other the gain in category efficiency is harder to measure. Any se- pages on the wiki, or metawiki—as opposed to URL links to mantic wiki must still be provided with categorisation meta- arbitrary webpages (in some cases, pages on the wiki with data such that the type of pages can be used to answer such special arguments). 36,076 edits modified the former, but queries. However, some improvement is to be expected, as only 8,525 the latter. there are current Wikipedia categories which could be in- With such a proportion of editing effort being expended ferred: either because they are a union of other categories on modifying links on Wikipedia, there is a clear need to im- (e.g. ‘Free software’ and ‘Operating systems’ cover the ex- prove this process. Introducing richer hypermedia features isting category ‘Free software operating systems’) or because to wikis, such as generic links, should prove one possible they are implied by a more specialised category, and no improvement. Generic links are links whose endpoints are longer need to be explicitly applied to a page. defined by matching on criteria of the document content: The increase in list overhead seen in the larger subset is a basic example being matching on a particular substring. likely a result of having a more representative proportion of A generic link can specify that a page’s title should link ‘List of’ pages. Otherwise, the results are largely consistent to that page, rather than requiring users to manually an- across sample sizes. notate it: some early wiki systems offered this capability, but only for page titles which were written in the unnatural 6.2 Link management ‘CamelCase’ capitalisation. Advanced examples such as lo- cal links, present in Microcosm [3, 4], can specify scope limits Table 2 shows categories related to the management of on the matching. This would help with ambiguous terms on links. “Links” refers to edits which changed either page- Wikipedia, such as ‘Interval’, which should be linked to a to-page or page-to-URL links. “Links only” refers to such specific meaning, such as ‘Interval (music)’. edits excluding those edits which also constituted a ‘major’ content change: they are edits concerned only with links and other structure. “Hyperstructure” is the category of edits 6.3 Overall editing distribution which changed any of the navigational capabilities of the Table 3 shows the categorisation of all edits in the 0.01% wiki: either categories, ‘List of’ pages, links, or templates. dataset, using Levenshtein for string distance, for registered “Content” is simply the category of ‘major’ edits. and unregistered users. Note that the edit categories are The overestimating effect of the approximate string dis- not mutually exclusive, thus will not sum to the total num- tance algorithm can be seen as a greater proportion of edits ber of edits by that class of user. “Minor” is the category being considered ‘major’, with a knock-on effect on reduc- of edits which did not appear to change anything substan- ing the ratios of over edits over content edits. However, the tial: either the information extracted from the markup re- results are consistent between the 0.01% subset with the mains the same, and the plaintext very similar; or a regis- approximated string distance, and the sample set four times tered user annotated the edit as minor. Notably, over 5% the size. As a result, it would appear that the smaller size of edits are reverts: edits completely rolling back the pre- Edit type Proportion Edit type Proportion Edit type Proportion Links 49.60% Links 49.60% Links 49.56% Links only 35.53% Links only 23.36% Links only 25.24% Hyperstructure 61.65% Hyperstructure 61.65% Hyperstructure 61.90% Content 17.81% Content 35.60% Content 35.99% Edit type Ratio over content Edit type Ratio over content Edit type Ratio over content Links 2.79 Links 1.39 Links 1.38 Links only 2.00 Links only 0.71 Links only 0.70 Hyperstructure 3.46 Hyperstructure 1.73 Hyperstructure 1.72 (a) 0.01% subset, Levenshtein (b) 0.01% subset, Approximated (c) 0.04% subset, Approximated Table 2: Proportions of edits related to link management vious edit; this implies that a further 5% of edits are being semantically recorded by MediaWiki, it is largely indistin- reverted (presumably as they are deemed unsuitable).8 A guishable from the addition of a paragraph of wikitext. As substantial amount of effort is being expended merely keep- a result, it is not then possible to evaluate the cost of main- ing Wikipedia ‘stationary’. taining or documenting these substitutions once the link to Figure 2 demonstrates the distribution of users over the the original template has been lost. total number of edits they have made, in the vein of the It is also not computationally feasible to detect the pattern Swartz study [10]. There is a sharp falloff of number of users of a user performing the same fix on multiple pages, which as the number of edits increases (note the logarithmic scale would identify the cost of inadequate, or underused, tran- on both axes): by far, most users only ever make very few sclusion. Transclusion is an inclusion-by-reference mecha- edits, whether registered or not. Unsurprisingly, registered nism, where a selected (fragment of a) document is included users tend to make more edits overall, and unregistered users ‘live’ into another, greatly facilitating re-use. are dominant at the scale of fewer than ten edits. In Wikipedia, it is often desirable to accompany a link to a Figure 3 breaks the low-edit end of this distribution down page with a short summary of that page’s topic. In particu- by basic categories. It is interesting to note that, other than lar, Wikipedia has many cases where articles include a sum- being in close proximity (e.g. “content” and “page link”), the mary of another article, along with a “main article” link. The lines do not have any definitive overlaps: the breakdown of ‘London’ page9 , for example, has many sections which con- edits is consistent regardless of the number of edits the user sist primarily of summaries of more detailed pages, such as has made. Users who have made 70 edits have made edits in ‘Education in London’. However, without some form of tran- the same relative proportions (i.e., more “revert” than “list sclusion or composition to share text, if the main article’s of”) as those who have only made five. summary changes—possibly because its subject changes— Figure 4 shows how the magnitude of edits breaks down this change must be replicated manually out to any page by the number of edits of that magnitude, again in the vein which also summarises it. A transclusion mechanism would of Swartz [10]. Because this is clearly sensitive to the string allow a single summary of the subject to be shared by all distancing algorithm, the 0.01% subset was used, with a fo- pages which reference it, including the main article on the cus on Levenshtein: the approximate distance for all users subject, if desired. is shown as a sparsely dotted line with a consistent over- For example, the ‘Education in London’ page may be- estimate. These results are largely unsurprising: registered gin with a summary of its topic, highlighting the most no- users make larger edits, and most edits are small, with the table institutions and successful research areas. The article count rapidly falling off as magnitude increases. on ‘London’ may then, within its ‘Education’ section, tran- sclude this summary from the ‘Education in London’ page. 6.4 Limitations of detection Should the summary be updated, perhaps because a Uni- There are, unfortunately, several kinds of ‘overhead’ costs versity gains significant notability in a new research area, which simply cannot be detected in a computationally fea- this change would be automatically reflected in the ‘Lon- sible manner by this approach. For example, MediaWiki don’ page, as it is using the same text. supports a feature called template ‘substitution’, which ac- While MediaWiki’s templates do function as transclusion, tually imports the template, with parameter substitution they are not employed for this role: common usage and de- performed (with some caveats), into the source text of the velopment effort focus on their use as preprocessing macros. including node. It is important to note that the relationship between the including and included nodes is lost, and that 7. CONCLUSIONS the benefits of re-use (such as storage efficiency and later cor- rections) are not available. The information regarding the The experiment consisted of the non-exclusive classifica- origin of the text is also lost without manual documenta- tion of edits made throughout the history of Wikipedia, tion effort, including any parameters required for the more a large and public wiki system. Classifications included complicated templates. Because use of this feature is not both the areas of “text editing” (assumed to be primarily maintaining the information content of Wikipedia: its ency- 8 clopædic articles), and “link editing” (maintaining the navi- Actual figures may vary in either direction: this does not detect rollbacks to versions earlier than the immediately pre- gational structure of the content). The hypothesis, that link ceding version, and ‘edit wars’ of consecutive rollbacks will 9 be entirely included in the first 5%, not belonging in the http://en.wikipedia.org/w/index.php?title= latter. London&oldid=155695080 105 All users Registered Unregistered 104 103 Number of users 102 101 100 101 102 103 Edit count, ±5 Figure 2: User distribution over total number of edits made; 0.04% subset 105 Content × Page link × URL link + + Template 4 104 4 Category 3 3 Revert 2 × List of 2 4 + Number 3 of users 103 × 4 3 + × 2 4 + 3 × 4 2 3 + × 4 + 3 102 2 × 4 3 + 2 2 × 4 + × + 4 3 3 × 4 2 3 + 2 2 2 101 10 20 30 40 50 60 70 80 90 Edit count, ±5 Figure 3: User distribution over total number of edits made, by category; 0.04% subset 105 All users Registered Unregistered 104 All, approx. 103 Number of edits 102 101 100 1000 2000 3000 4000 5000 6000 7000 Edit magnitude, ±50 Figure 4: Edit distribution over magnitude of edit; 0.01% subset editing formed a substantial proportion of total editing ef- [3] H. Davis, W. Hall, I. Heath, G. Hill, and R. Wilkins. fort, which may potentially be automated, was supported by Towards an integrated information environment with the results. Twice as many edits changed links alone, not open hypermedia systems. In ECHT ’92: Proceedings affecting the article text. Edits which maintained manual of the ACM conference on Hypertext, pages 181–190, indexes of pages constituted approximately a tenth of total New York, NY, USA, 1992. ACM Press. edits. [4] A. M. Fountain, W. Hall, I. Heath, and H. Davis. We are continuing this work with a more detailed, small- MICROCOSM: An open model for hypermedia with scale experiment, to understand better the patterns of real- dynamic linking. In European Conference on world wiki editing. It is being treated as a knowledge elic- Hypertext, pages 298–311, 1990. itation task, to gather information on the mental processes [5] F. Halasz and M. Schwartz. The Dexter hypertext behind wiki editing: information on the tasks editors set reference model. Communications of the ACM, themselves, and how their actions are used to achieve them. 37(2):30–39, 1994. Our long-term goal is to continue this research by means of [6] M. Krötzsch, D. Vrandečić, and M. Völkel. Wikipedia development and evaluation of a prototype system, informed and the semantic web - the missing links. In by these studies, which can be used to test the hypothesis Proceedings of the WikiMania2005, 2005. Online at that increased hypermedia features actually result in bene- http://www.aifb.uni-karlsruhe.de/WBS/mak/pub/ fits such as a decrease of editing overhead. wikimania.pdf. [7] V. Levenshtein. Binary Codes Capable of Correcting 8. REFERENCES Deletions, Insertions and Reversals. Soviet Physics [1] T. Berners-Lee, R. Cailliau, J.-F. Groff, and Doklady, 10:707, Feb 1966. B. Pollermann. World-Wide Web: The Information [8] F. Manola and E. Miller. RDF Primer. Technical Universe. Electronic Networking: Research, report, W3C, Feb 2004. Applications and Policy, 1(2):74–82, 1992. [9] T. Nelson. Literary Machines. Mindful Press, [2] V. Bush. As We May Think. The Atlantic Monthly, Sausalito, California, 93.1 edition, 1993. 176:101–108, Jul 1945. [10] A. Swartz. Who Writes Wikipedia? Online at http: //www.aaronsw.com/weblog/whowriteswikipedia, Sep 2006.