=Paper= {{Paper |id=Vol-2929/paper4 |storemode=property |title=The Secret Life of Wikipedia Tables |pdfUrl=https://ceur-ws.org/Vol-2929/paper4.pdf |volume=Vol-2929 |authors=Tobias Bleifuß,Leon Bornemann,Dmitri V. Kalashnikov,Felix Naumann,Divesh Srivastava |dblpUrl=https://dblp.org/rec/conf/vldb/BleifussBKNS21 }} ==The Secret Life of Wikipedia Tables== https://ceur-ws.org/Vol-2929/paper4.pdf
                                        The Secret Life of Wikipedia Tables
                   Tobias Bleifuß                                          Leon Bornemann                               Dmitri V. Kalashnikov∗
           Hasso Plattner Institute,                                  Hasso Plattner Institute,                              dmitri.vk@acm.org
       University of Potsdam, Germany                             University of Potsdam, Germany
           tobias.bleifuss@hpi.de                                    leon.bornemann@hpi.de

                                               Felix Naumann                                       Divesh Srivastava
                                         Hasso Plattner Institute,                        AT&T Chief Data Office, United States
                                     University of Potsdam, Germany                                divesh@att.com
                                         felix.naumann@hpi.de

ABSTRACT
Tables on the web, such as those on Wikipedia, are not the static
grid of values that they seem to be. Rather, they have a life of their
own: they are created under certain circumstances and in certain
webpage locations, they change their shape, they move, they grow,
they shrink, their data changes, they vanish, and they re-appear.
When users look at web tables or when scientists extract data from
them, they are most likely not aware that behind each table lies a
rich history.
   For this empirical paper, we have extracted, matched and ana-
lyzed the entire history of all 3.5 M tables on the English Wikipedia
for a total of 53.8 M table versions. Based on this enormous dataset
of public table histories, we provide various analysis results, such
as statistics about lineage sizes, table positions, volatility, change
intervals, schema changes, and their editors. Apart from satisfying                        Figure 1: An example of an evolving table in Wikipedia
curiosity, analyzing and understanding the change-behavior of web                          (from https://en.wikipedia.org/?diff=prev&oldid=541341520).
tables serves various use cases, such as identifying out-of-date val-
ues, recognizing systematic changes across tables, and discovering
change dependencies.                                                                       of purposes, as was recently surveyed in [10], including entity
                                                                                           extraction and fact generation [16], improving web search [21],
Reference Format:                                                                          and entity linking [19]. Other work seeks to enhance web tables
Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann,                      themselves, such as generating their title [14], generating column
and Divesh Srivastava. The Secret Life of Wikipedia Tables. In the 2nd                     headers [24], or finding subject columns [5, 25]. Again, all of these
Workshop on Search, Exploration, and Analysis in Heterogeneous                             approaches make use of table content, headers, and surrounding
Datastores (SEA Data 2021).                                                                text and data. Providing more such data, and in particular different
                                                                                           versions of such data, gives these machine learning approaches a
                                                                                           richer input set.
1    EMPIRICAL RESEARCH ON THE WEB                                                             In the context of our Janus project [3], we have been extract-
Traditionally, empiricism plays a minor role in the theory- and                            ing and working with the histories of various structured datasets,
engineering-oriented field of our research community, while it has                         including DBLP, IMDB, open government data, and in particular
played a significant role in other disciplines of computer science                         Wikipedia, for which a detailed history of every edit is available.
(e.g., [6, 7]). Hardly ever do we pause to analyze and reflect on the                      In this empirical paper, we focus on tables as they appear on Wi-
observable, “natural” behavior of data and systems. Among the                              kipedia pages and report on our various observations across their
notable exceptions is the area of data quality research [22].                              lifetime, including their creation (Section 4), their evolution over
   One example of observable behavior is that of web tables, in                            time (Section 5), and ultimately their deletion (Section 6). We re-
particular those that are collaboratively edited. As such, an example                      port on such varied dimensions as table-counts, users, duration,
of a very heterogeneous and semi-structured data lake is the set                           edits, table similarity, table position, and of course time itself, high-
of tables on Wikipedia. Such web tables are used for a variety                             lighting expected and some surprising behavior. Figure 1 shows
∗ Work done while at AT&T Research.
                                                                                           one exemplary evolutionary step (in schema and data) for one of
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
                                                                                           millions of Wikipedia tables.
for the volume as a collection by its editors. This volume and its papers are published        Our results can help researchers better understand the volatility
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).              of web tables: a given table or corpus snapshot is not a stable basis
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,           but rather just that: a snapshot with a history of changes leading
Copenhagen, Denmark) on CEUR-WS.org.                                                       up to it and a future with many further changes. In fact, at the
time of writing any given Wikipedia table was changed twice in                 3     TABLE CORPUS
the past year, on average, but with a standard deviation of 9.1,               To explore tables over time, we need to be able to track tables over
and some tables changed multiple times per day. But not only                   time, which is a non-trivial task as tables and their context can
the content of tables change, also their schema evolves over time.             change over time. We consider tables as objects with an identity,
This information about evolving schema can serve, for example, to              which in contrast to its concrete shape and content, stays constant
identify synonymous attributes [23].                                           over time. A table can have multiple versions, where each version is
   In the following, we highlight selected analyses in this paper              an edit of the previous version. However, tables on the web usually
and outline their possible implications for researchers:                       lack a stable identifier, which is why we have to infer that identifier.
                                                                                    We proposed a solution for this identity inference through a table
Need for timeliness. Figure 8 shows how quickly a snapshot be-
                                                                               matching procedure. The details of this work are described in [4],
     comes outdated. As a consequence, all models that are trained
                                                                               where we also evaluate the matching and show that it works much
     on static snapshots run the risk of quickly becoming obsolete.
                                                                               better than related work. Our input for the table change extraction
     Efficient methods for updating these are therefore desirable.
                                                                               process is a dump of web page versions – either snapshots that
Help with maintenance and updating. A large portion of ta-
                                                                               have been crawled, and specifically for Wikipedia the complete
     bles is created and maintained by power-users, as can be
                                                                               edit history as a set of XML files1 . These XML files contain the
     seen in Figures 5 and 10. Knowledge about update patterns
                                                                               actual version content encoded as Wikitext, a markup language
     could be used to notify these editors about (potentially) out-
                                                                               including table markup, as well as additional metadata for each of
     dated values and the need for updates.
                                                                               the revisions, such as its timestamp, author, and comments.
Suggestions for cleaning and improvement. Figure 16 illustrates
                                                                                    For every page version, we extract a (possibly empty) list of
     an example of the inconsistencies that can emerge in tables.
                                                                               parsed table nodes. This list of table nodes for every page revision
     Based on the knowledge of how similar tables evolve, one
                                                                               constitutes the input for our table matching. For each page revision,
     can make concrete suggestions to improve their (in this case)
                                                                               it is necessary to decide whether the tables therein are versions
     schemata, such as to rename or add certain columns.
                                                                               of previously identified tables or entirely new tables. It is not suf-
                                                                               ficient to consider only tables of the directly preceding version,
2    RELATED WORK                                                              because tables can be deleted and be restored several revisions
We discuss two types of related work: structured datasets with                 (and sometimes several years) later. To determine the quality of our
history and Wikipedia change analysis. There is a lot of research              matching, we have manually created a gold standard of table match-
on web tables, so we can only provide a high-level overview in                 ings comprising 1,445 tables, with a total of 16,919 distinct table
this short paper and refer to surveys and research papers for more             versions selected from 90 pages. We show that the matching works
details.                                                                       well, matching all versions that belong to a given table correctly
                                                                               for 88.58% of all tables in the gold standard. For a majority of the
                                                                               remaining tables, we misclassify only a small number of versions
Related corpora. Wikipedia provides access to its entire version
                                                                               of individual table histories (1 mistake: 6.09%; 2 mistakes: 2.98%).
history, allowing us to track very fine-grained changes. A variety
                                                                               Our matching decisions for individual table versions reach >99%
of datasets that also deal with (semi-) structured content have been
                                                                               𝐹 1 -measure.
extracted from the web and Wikipedia before. Multiple corpora of
                                                                                    We next present various statistics based on the 3,471,609 Wiki-
web tables [12, 18] provide extracts of static versions of tables on
                                                                               pedia table objects we have collected that have been linked using
the web and have since been subject to extensive research [10]. For
                                                                               our matching process. These statistics are based on the Wikidump
example, Lautert et al. establish a taxonomy of web tables and thus
                                                                               of September 1, 2019. Both our gold standard and output dataset
give an insight into the general structure of static web tables in [17],
                                                                               are available at our project website www.IANVS.org. The differ-
whereas we focus on the temporal evolution of web tables. The
                                                                               ent statistics and findings are grouped by the three phases of a
infobox history dataset WHAD [2] comprises structured informa-
                                                                               table’s existence: creation, evolution and deletion. Already these
tion on Wikipedia, namely the changes of infoboxes. This dataset
                                                                               three phases of existence show that establishing a table identity is
is orthogonal to our dataset, since it does not cover general tables,
                                                                               essential for the following statistics, because without it, it would
which are more diverse and also more complex in comparison.
                                                                               not be possible to determine statistics that aggregate on a per-table
                                                                               basis (but only on a per-version basis).
Analyzing changes in Wikipedia. The content of Wikipedia
has been the subject of much research [20]. While large parts of               4     CREATION
that research were conducted on static snapshots of Wikipedia, a
                                                                               In this section, we focus on the first insert of every table – its
variety of works analyzes changes on Wikipedia. Both the evolution
                                                                               creation – even when during its lifetime a table might be deleted
of content [11] and the evolution of the page link graph [8] have
                                                                               and recreated multiple times.
been studied. Specifically, the study of content evolution can help
detect conflicts [15] or controversy that may result in edit-wars [9].         4.1     Where and when are tables created?
The edit histories serve as input to event-extraction [13] and are
also valuable for trust assignment [1]. Our approach can provide a             Figure 2 shows that Wikipedia pages in their initial years (2001–
better understanding of what has really changed, from which many               2003) had almost no tables. Using tables in Wikipedia became more
of these studies should benefit.                                               1 https://dumps.wikimedia.org/

                                                                           2
                                                                                                                                                                                                                          15k
Number of new tables and pages




                                                                                                 663k
                                 60k                        Pages




                                                                                   100k


                                                                                                327k
                                                                                                                                                                                                                          10k




                                                                                                                                                                                                         Same year
                                                                      Number of pages




                                                                                              143k
                                                            Tables




                                                                                              51k
                                                                                                                                                                                                                          5k




                                                                                            15k




                                                                                                                                                                       2015
                                                                                                                                                           Mentioned year
                                                                                   1,000




                                                                                           4k
                                 40k                                                                                                                                                                                      0




                                                                                                                  734
                                                                                                                                                                                                                          15k




                                                                                                               113




                                                                                                                                                                       2010




                                                                                                                                                                                                         Different year
                                 20k




                                                                                                             21
                                                                                   10




                                                                                                             4
                                                                                                                                                                                                                          10k




                                                                                                                                          1
                                                                                                                                                                                                                          5k




                                                                                                                                                                       2005
                                   0                                                       1 –3 –7 15 31 63 127 255 511 023 047
                                                                                            2 4 8– 16– 32– 4– 8– 6– –1 –2
                                                                                                         6 12 25 512 024                                                                                                  0
                                       2005   2010   2015      2020                                                  1                                                      2005      2010     2015
                                          Creation date                                         Maximum table count                                                                Table creation date

Figure 2: Number of tables and pages cre- Figure 3: Histogram of the maximum ta- Figure 4: Correlation of years mentioned
ated per month.                           ble count per page.                    in categories and time of creation.


popular only around 2004 and tables were fully adopted by end of




                                                                                                                                    0 100k200k300k400k




                                                                                                                                                                    13.125%
                                                                                                                       Number of tables
2006. Since then, every month around 20,000 new tables are cre-




                                                                                                                                                                 5.286%
ated (about one every two minutes). The hypothesis that insertion




                                                                                                                                                                 5.178%



                                                                                                                                                              10.026%
                                                                                                                                                               9.820%
                                                                                                                                                               4.058%


                                                                                                                                                              9.335%
                                                                                                                                                             3.441%




                                                                                                                                                             3.364%
frequency would decrease once tables are inserted at all relevant




                                                                                                                                                             3.057%
                                                                                                                                                            2.636%




                                                                                                                                                            7.996%
                                                                                                                                                           7.836%




                                                                                                                                                          1.588%
locations seems false: While the number of new pages created per




                                                                                                                                                         6.782%
                                                                                                                                                         6.472%
month drops since 20072 , the insertion-rate of new tables remains
constant. This relative increase in tables per page shows that more
and more data is stored in a structured fashion, raising the relevance
of methods to extract knowledge from said tables.

                                                                                                                                                                     2–1
                                                                                                                                                                     43
                                                                                                                                                                   8––7
                                                                                                                                                                  16 15
                                                                                                                                                                 642–61
                                                                                                                                                                12 –1 3
                                                                                                                                                                2 8– 27
                                                                                                                                                               5156–255
                                                                                                                                                             2024–1021
                                                                                                                                                             4 48 20 3
                                                                                                                                                           81096–4047
                                                                                                                                                         No63982––8195
                                                                                                                                                            t r 4–16391
                                                                                                                                                               eg 32 83
                                                                                                                                                                    er 7
                                                                                                                                                                      ed
                                                                                                                                                                  3 –3


                                                                                                                                                             10 2– 51




                                                                                                                                                                 ist 76
   We observed separately that most tables are created at the same
time or soon after the page containing them is created. Only for
pages that were created at the beginning of Wikipedia, when tables




                                                                                                                                                          1
were not so popular, larger gaps between the page creation and the                                                                                       Number of tables created by user
creation of the first table on the page are common.
   Figure 3 shows a histogram of the maximum number of tables                                               Figure 5: Histogram of tables bucketed by the total number
that ever existed simultaneously on a Wikipedia article. The vast                                           of tables an author created.
majority of Wikipedia articles contain only a few tables (we omitted
the even larger number of pages that do not contain any tables at
all). On the other hand, most tables appear on pages together with
other tables. Only 19.1% of all tables appear alone on a Wikipedia
article. The many tables that exist in the vicinity of each other can                                       mentioned year and the year of creation align, for 5.6% of the cases
be assumed to be related in terms of content.                                                               the tables are created before the mentioned years and in 43.5% of
   On Wikipedia, every article can link to categories, which are                                            the cases, the tables are created after the mentioned year.
used to group related articles to a topic and can themselves be
organized in categories. We investigate how the creation dates of
tables correlate with any year mentioned in these page categories
                                                                                                            4.2     Who creates tables?
(such as 2020 for “2020 United States presidential election”), which                                        The distribution of the number of tables that a user creates is shown
we assume to be the relevant years for that table. Figure 4 shows that                                      in Figure 5. Only 13.1% of the tables are created by non-registered
the extracted years and the creation year match for most tables. For                                        users. The figure also clearly shows that tables are more likely to
every mention of a year in the page categories, a table is counted in                                       be created by power-users: More than half of the tables are created
a cell that represents the month of creation (on the x-axis) and the                                        by users who each have also created at least 128 other tables. The
mentioned year (y-axis). If those two dimensions would perfectly                                            record for the highest number of tables created by a single user is
align, we would only see marks close to the diagonal of the plot.                                           20,194 (in this case on a variety of sports events).
There is a tendency that tables are rather created in the second half                                          A possible explanation for this behavior could be that the effort
of the mentioned year or in the beginning of the following year,                                            and skill it takes to create a new table is too high for many users.
which shows as a small shift to the right in the plot. For those years                                      On the other hand, there are very dedicated users who must have
that are covered by our dataset (2004–2019), in 50.8% of all cases the                                      acquired the necessary skills and possibly tools to create thousands
                                                                                                            of tables. One insight that we can take from this observation is that
2 Source:        https://stats.wikimedia.org/#/en.wikipedia.org/contributing/new-                           any random sample of tables is likely to be influenced by those
pages/normal|line|2001-01-01~2020-10-01|page_type~content|monthly                                           power users.
                                                                                                        3
              10k




                                                                                                                                               Number of tables
                                                                                                                                                                  2M
              1k
Table count




                                                                                                                                                                  1M
              100                                                         100
                                                                           1k                                                                                      0
               10                                                         10k                                                                                           2005     2010   2015     2020
                                                                         100k                    (a) A general table template
                                                                                                                                                                            Snapshot date
                1                                                         1M
                     1                                                                                                                                                 to be deleted    to be updated
                                               10              100       1k
                                                           User count                        (b) A sport specific table template                                       unchanged        to be created


Figure 6: Numbers of users that Figure 7: Two concrete examples of table templates.                                                                 Figure 8: Missed updates in rela-
use the same template for differ-                                                                                                                   tion to snapshot date.
ent numbers of tables.

                                                          15                                                      plot). These are usually domain-specific templates, such as tables
                         Time since last update [years]




                                                                                                                  for sports results or statistics (see Figure 7b for an example).

                                                          10
                                                                                                                  5     EVOLUTION
                                                                                                                  The second phase in the lifetime of a table is its evolution. This
                                                          5                                                       phase encompasses all changes that happen between the initial
                                                                                                                  creation and the possible final deletion, including changes to data,
                                                                                                                  to schema, and to shape.
                                                          0
                                                               0              5      10     15                    5.1    How often are tables changed?
                                                                        Table age [years]
                                                                                                                  The average table in our corpus is re-inserted 0.62 times, deleted
                                                                                                                  0.93 times, and updated 13.89 times. Of the 0.62 re-inserts, 0.10 are
                                   Figure 9: Table freshness over time.
                                                                                                                  fresh table versions, i.e., the table’s content is different from any
                                                                                                                  previous version, which means 0.52 of the inserts restore previously
                                                                                                                  existing table versions that were deleted at some prior point in time.
                                                                                                                  For the 13.89 updates, the ratio of fresh and old versions is 11.97
4.3                 How are tables created?                                                                       fresh versus 1.92 updates that restore previously existing versions.
Creating tables is a tedious job, especially for inexperienced users,                                             While these average numbers seem quite low, there is a large skew:
who might not be familiar with the syntax of tables. An obvious hy-                                               there is a table on social networking websites that was updated
pothesis is therefore that users copy & paste similar tables (created                                             more than 10,000 times during its lifetime. At least 1,310 tables were
by other users or themselves) and adapt them according to their                                                   each updated more than 1,000 times during their lifetimes.
needs. To investigate this hypothesis, we studied the frequency                                                      Figure 8 shows the number of tables that would have been cre-
with which the same content appears in the first version of differ-                                               ated/updated/deleted by the date of our analyzed snapshot (Sep-
ent tables. For a more accurate picture, we also analyzed how many                                                tember 1, 2019), if the snapshot were taken at a previous point in
users chose to use exactly the same table markup code, presumably                                                 time (shown on the x-axis). In a one-month-old snapshot, already
as table templates.                                                                                               4.4% of tables are outdated. If the snapshot were taken a year earlier,
   The first observation we made is that 3,004,883 of the 3,471,609                                               26.6% of the tables would no longer represent the current state. In
tables (86.6%) in our corpus appear to have unique first versions.                                                a 5 years time range, this number rises to 60.6%.
This does not imply that they are not copied from somewhere else,                                                    The violin plot in Figure 9 shows how the change frequency
but were modified prior to the first save of the page. On the other                                               behaves with increasing table age. The shape of a violin plot follows
end of the spectrum, there are templates that are used more than                                                  the distribution of the values: the wider the line, the more probable
15,000 times to create tables.                                                                                    the value. Within the violin plot, there are marks at the 0.25/0.5/0.75
   As can be seen in Figure 6, the ratio of tables and number of                                                  quantiles. In particular, it shows how the time since the last update
users that use the same template greatly varies. Some templates                                                   is distributed for tables at different ages. The median rises until a
are used for thousands of tables, but also by thousands of different                                              certain point, after which it stays constant or slightly decreases
users (top right in the plot). This is usually the case for example                                               again. However, the distribution is skewed towards the two ends
tables that contain only dummy values (an example can be seen in                                                  of the spectrum: tables either are very frequently updated or are
Figure 7a). However, there are other templates that are also used for                                             hardly ever changed. For example, considering the quantiles for the
thousands of tables, but only by a few dozen users (top left in the                                               5-year-old tables, more than 25% of these tables were updated in
                                                                                                              4
                 100%




                                                                                           1M
                                                                                           1M
                                                                                                                                            600




                                                                                        653k
                                                                         100k




                                                                                       334k
                                                            Number of tables




                                                                                      155k
                 75%
Share of edits




                                                                                     65k




                                                                                                                             New position
                                                                                    24k
                                                                                   8k
                 50%                                                                                                                        400




                                                                         1,000




                                                                                 2k
                                                                                                                367
                 25%
                                                                                                                                            200




                                                                                                              58
                                                                                                                                                                        1




                                                                         10
                  0%                                                                                                                                                  100




                                                                                                                     1
                        0        5      10     15                                                                                             0                     10000
                               Table age [years]                                 1 –3 –7 15 31 63 127 255 511 023 047 095
                                                                                  2 4 8–16–32–4– 8– 6– –1 –2 –4
                                                                                              6 12 25 512 024 048                                 0   200     400         600
                                                                                                          1 2
                            By non-creator     By creator                                Number of editors                                            Previous position

    Figure 10: Creator update activity for ta-                      Figure 11: Number of editors per table.                      Figure 12: A density plot of table posi-
    bles created by registered users.                                                                                            tion differences between two consecu-
                                                                                                                                 tive table versions.


    the last year and for another 25% the last update was almost more                                  5.4      How much do tables change?
    than four years ago.                                                                               Figure 13 shows how the content of tables develops over time. More
                                                                                                       precisely, it shows a similarity score of each table compared to the
    5.2             Who changes tables?                                                                first version of that table (calculated on a random subset of 1,000
                                                                                                       tables). We use a similarity metric that is based on a Í word vector
    Figure 10 shows how long the original creator is active as an updater                                                                                           min(𝑣 ,𝑤 )
    of a table. We distinguish between registered and unregistered cre-                                representation of both table versions: sim(®        ® = 𝑖 max(𝑣𝑖 ,𝑤𝑖 ) ,
                                                                                                                                                       𝑣 , 𝑤)   Í
                                                                                                                                                                  𝑖      𝑖 𝑖
    ators, because for unregistered creators we have only the IP address                               where 𝑣® and 𝑤® are word vectors of the two table versions that should
    as an identifier, which might change from time to time. Therefore,                                 be compared. In general, the similarity is expected to decrease over
    it is not too surprising that the share of edits that is done by an                                time, but it can also rise if the table content becomes more similar
    unregistered creator quickly drops and, hence, we exclude tables                                   to its original version. While there are some tables that stay almost
    created by unregistered users from this plot. On the other hand,                                   unchanged throughout their lifetime, there are other tables that
    for registered users, there are tables that are still updated by the                               rapidly change within the first few days of their existence. One
    original creators years after they have been created. In reality, the                              reason for this could be that people copy & paste other tables as
    influence of the original author on a table could be even higher than                              templates and then adjust the content, as explained in Section 4.3.
    what this plot suggests: the number of edits for a table decreases                                    During their lifetime, 23.6% of all tables either grow or shrink in
    over time, so the first buckets contain more edits.                                                the number of columns, 37.0% grow or shrink in the number of rows.
        When we look at the number of editors that change individual                                   However, 57.3% of all tables retain their original size throughout
    tables in Figure 11, we see that a large number of tables (35%) are                                their lifetime.
    updated by the creator of the table only. In this analysis, we only
    consider users as editors of a table who create a new version (a
    simple revert to a previous version is not counted). While most of
                                                                                                       5.5      How much do schemata change?
    the tables are updated by only a few users, there are some excep-                                  About half of all tables never change their schema, as can be seen
    tions where thousands of users contribute to the table. Again, the                                 in Figure 14 (note that this is a log-log plot). On the other side,
    previously mentioned table on social networking websites holds                                     there are tables that change their schema hundreds of times, up to
    the record with contributions by 4235 distinct users.                                              443 changes. On average, each table has 1.86 schema versions. The
                                                                                                       types of schema change can be manifold. For example, columns are
                                                                                                       renamed, columns are added or removed.
    5.3             Are tables moved?                                                                     Figure 16 shows a vivid example of how schemata of web tables
    Figure 12 shows how much tables move in relation to other tables                                   evolve over time. To create this plot, we created a clustering of
    on their page. While for most page revisions, the tables do not                                    schemata based on tables that evolve from one schema to another.
    move or move only slightly, there are page revisions for which                                     This particular plot shows a cluster of schemata that all contain
    tables move by up to 1,574 positions for a single page (we removed                                 information about league results of football teams. There are almost
    this one extreme case as an outlier). We observe that if tables move,                              500 tables for which at least one of the snapshots had one of the
    this is often due to the insertion or deletion of tables and that tables                           Schemata 2–7. More than half of those tables followed Schema 6
    rather move down on the page (64.09%) than up (35.91%). One                                        at the beginning of 2011, while the other half mostly did not yet
    obvious reason for this imbalance is the fact that a table inserted                                exist (Group 1). The splines show how this schema evolved to
    in the middle of the page causes other tables to move down, and                                    many different specializations until 2018. While in some cases these
    insertions are more common than deletions. On average, a table’s                                   specializations make sense (such as a clarification about the league
    position changes 1.66 times during its lifespan.                                                   system), in other cases they are due to inconsistent changes (such as
                                                                                                   5
                 100%                                                                                                                                             100%
                                                                                      1M                                                                                                            deleted
                                                                                                                                                                                                    alive
                   75%                                                               100k                                                                         75%




                                                                                                                                                Share of tables
Similarity sim




                                                                                     10k




                                                                             Count
                   50%                                                                 1k                                                                         50%

                                                                                      100
                   25%                                                                                                                                            25%
                                                                                       10

                    0%                                                                  1                                                                          0%
                                0           5         10          15                            1   2     5 10 20         50 100 200 500                                 0        5      10        15
                                         Table age [years]                                              Schema versions                                                        Table age [years]

    Figure 13: Similarity of table versions                                    Figure 14: Number of schema versions                                                 Figure 15: Time until deletion.
    and the table’s first version. Each line                                   per table.
    represents one table.

                         500
                                                                                                                                                                             1 NON-EXISTENT

                         400                                                                                                                                       2 Season, Division, [[Bavarian football
                                                                                                                                                                       league system|Tier]], Position

                         300
                                                                                                                                                                      3 Season, Division, Tier, Position
                 Count




                                                                                                                                                                    4 Year, Division, [[Bavarian football
                         200                                                                                                                                            league system|Tier]], Position
                                                                                                                                                                    5 Year, Division, [[German football
                         100                                                                                                                                            league system|Tier]], Position
                                                                                                                                                                           6 Year, Division, Position
                                                                                                                                                                            7 Year, Division, Position
                           0                                                                                                                                                   8 OTHER-SCHEMA
                               2008-01    2009-01   2010-01   2011-01    2012-01      2013-01       2014-01   2015-01       2016-01   2017-01   2018-01
                                                                                                                   Time

                                                                        Figure 16: Example of schemata evolving over time.


   the header “Year”, which after manual inspection should actually be                                                  While the vast majority of tables is never deleted (57.2%) or
   “Season”, a range spanning two consecutive years, in most cases).                                                 deleted only once (29.9%), there is a larger skew in the distribution
   As these tables are webtables, the header can also be formatted                                                   of deletes. One table that explains the Wiki syntax was deleted 620
   differently and we can see that for most tables of Schema 7, the                                                  times during its lifetime, mostly from vandalism.
   header was changed to bold-type (Schema 6) in between 2009 and
   2010. Still, there is a small number of tables that even after almost
   a decade still did not make this transition.
                                                                                                                     7       CONCLUSIONS
    6               DELETION                                                                                         In summary, we have seen how fast tables on Wikipedia change and
    Figure 15 shows how long tables survive, counting the days from                                                  how fast they come and go. When working with this corpus, it is
    their creation. The blue part shows the percentage of tables that                                                important to keep this additional temporal dimension in mind and
    reached the respective age without being deleted. The green part                                                 leverage it when possible. The history also makes other dimensions
    represents those tables that have been created long enough ago                                                   of the corpus accessible, such as the creators, editors, or templates,
    such that they could have reached the respective age, but were                                                   which together provide a perspective on the tables that is more
    deleted before reaching that age. 69.5% of all tables ever created                                               holistic than single snapshots of individual tables or a table corpus.
    have survived until the end-date of our dataset. If a table is deleted,                                             As future work, we plan to explore whether other structured cor-
    then this usually happens at the beginning of its lifetime. The longer                                           pora, such as Wikipedia infoboxes or lists, for which we also provide
    a table exists, the less likely it becomes that it will be deleted.                                              histories, behave similarly in terms of their dynamics. Furthermore,
       From the time a table is created until it is deleted (or until the                                            we want to use the gained insights to assign trust to values and
    end-date of the dataset), the average in our table corpus is 4.93 years.                                         improve data quality. We encourage researchers to explore their
    In 97.7% of that time, the table is truly part of the page, while in the                                         datasets in a similar manner to uncover hidden information in a
    remaining 40.50 days the table is (temporarily) deleted.                                                         dataset’s history.
                                                                                                               6
REFERENCES                                                                                        (SSDBM). 8:1–8:12.
 [1] B. Thomas Adler, Krishnendu Chatterjee, Luca De Alfaro, Marco Faella, Ian               [13] Mihai Georgescu, Nattiya Kanhabua, Daniel Krause, Wolfgang Nejdl, and Stefan
     Pye, and Vishwanath Raman. 2008. Assigning trust to Wikipedia content. In                    Siersdorfer. 2013. Extracting event-related information from article updates in
     Proceedings of the International Symposium on Wikis (WikiSym). 26:1–26:12.                   Wikipedia. In Advances in Information Retrieval (ECIR). Springer, 254–266.
 [2] Enrique Alfonseca, Guillermo Garrido, Jean-Yves Delort, and Anselmo Peñas.              [14] Braden Hancock, Hongrae Lee, and Cong Yu. 2019. Generating Titles for Web
     2013. WHAD: Wikipedia historical attributes data - Historical structured data                Tables. In Proceedings of the International World Wide Web Conference (WWW).
     extraction and vandalism detection from the Wikipedia edit history. Language                 ACM, 638–647.
     Resources and Evaluation 47, 4 (2013), 1163–1190.                                       [15] Aniket Kittur, Bongwon Suh, Bryan A. Pendleton, and Ed H. Chi. 2007. He says,
 [3] Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix               she says: conflict and coordination in Wikipedia. In Proceedings of the International
     Naumann, and Divesh Srivastava. 2018. Exploring Change – A New Dimension                     Conference on Human Factors in Computing Systems (SIGCHI). 453–462.
     of Data Analytics. PVLDB 12, 2 (2018), 85–98.                                           [16] Flip Korn, Xuezhi Wang, You Wu, and Cong Yu. 2019. Automatically Generat-
 [4] Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and                     ing Interesting Facts from Wikipedia Tables. In Proceedings of the International
     Divesh Srivastava. 2021. Structured Object Matching across Web Page Revisions.               Conference on Management of Data (SIGMOD). 349–361.
     In Proceedings of the International Conference on Data Engineering (ICDE).              [17] Larissa R. Lautert, Marcelo M. Scheidt, and Carina F. Dorneles. 2013. Web Table
 [5] Leon Bornemann, Tobias Bleifuß, Dmitri V Kalashnikov, Felix Naumann, and Di-                 Taxonomy and Formalization. SIGMOD Record 42, 3 (2013), 28–33.
     vesh Srivastava. 2020. Natural Key Discovery in Wikipedia Tables. In Proceedings        [18] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016.
     of The Web Conference 2020. 2789–2795.                                                       A large public corpus of web tables containing time and context metadata. In
 [6] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar                    Proceedings of the International Conference Companion on World Wide Web. 75–76.
     Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener. 2000. Graph             [19] Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking
     structure in the Web. Comput. Networks 33, 1-6 (2000), 309–320.                              temporal records. PVLDB 4, 11 (2011), 956–967.
 [7] A.D. Broido and A. Clauset. 2019. Scale-free networks are rare. Nature Commu-           [20] Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto
     nications 10, 1017 (2019).                                                                   Lanamäki. 2015. The sum of all human knowledge: A systematic review of
 [8] Luciana S. Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano             scholarly research on the content of Wikipedia. Journal of the Association for
     Millozzi. 2006. Temporal analysis of the wikigraph. In International Conference              Information Science and Technology 66, 2 (2015), 219–245.
     on Web Intelligence (WI). 45–51.                                                        [21] Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the
 [9] Siarhei Bykau, Flip Korn, Divesh Srivastava, and Yannis Velegrakis. 2015. Fine-              Web Using Column Keywords. PVLDB 5, 10 (2012), 908–919.
     grained controversy detection in Wikipedia. In Proceedings of the International         [22] Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F.
     Conference on Data Engineering (ICDE). 1573–1584.                                            Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh
[10] Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu,                       Srivastava. 2017. Data Quality: The Role of Empiricism. SIGMOD Record 46, 4
     Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12                    (2017), 35–43.
     (2018), 2140–2149.                                                                      [23] Paolo Sottovia, Matteo Paganelli, Francesco Guerra, and Yannis Velegrakis. 2019.
[11] Andrea Ceroni, Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, and                     Finding Synonymous Attributes in Evolving Wikipedia Infoboxes. In Advances
     Marco Fisichella. 2014. Information evolution in Wikipedia. In Proceedings of the            in Databases and Information Systems (ADBIS). 169–185.
     International Symposium on Open Collaboration (OpenSym). 24:1–24:10.                    [24] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu. 2012. Un-
[12] Julian Eberius, Maik Thiele, Katrin Braunschweig, and Wolfgang Lehner. 2015.                 derstanding tables on the web. In Proceedings of the International Conference on
     Top-k Entity Augmentation Using Consistent Set Covering. In Proceedings of                   Conceptual Modeling (ER). Springer, 141–155.
     the International Conference on Scientific and Statistical Database Management          [25] Ziqi Zhang. 2017. Effective and Efficient Semantic Table Interpretation using
                                                                                                  TableMiner+. Semantic Web 8, 6 (2017), 921–957.




                                                                                         7