=Paper=
{{Paper
|id=Vol-2929/paper4
|storemode=property
|title=The Secret Life of Wikipedia Tables
|pdfUrl=https://ceur-ws.org/Vol-2929/paper4.pdf
|volume=Vol-2929
|authors=Tobias Bleifuß,Leon Bornemann,Dmitri V. Kalashnikov,Felix Naumann,Divesh Srivastava
|dblpUrl=https://dblp.org/rec/conf/vldb/BleifussBKNS21
}}
==The Secret Life of Wikipedia Tables==
The Secret Life of Wikipedia Tables
Tobias Bleifuß Leon Bornemann Dmitri V. Kalashnikov∗
Hasso Plattner Institute, Hasso Plattner Institute, dmitri.vk@acm.org
University of Potsdam, Germany University of Potsdam, Germany
tobias.bleifuss@hpi.de leon.bornemann@hpi.de
Felix Naumann Divesh Srivastava
Hasso Plattner Institute, AT&T Chief Data Office, United States
University of Potsdam, Germany divesh@att.com
felix.naumann@hpi.de
ABSTRACT
Tables on the web, such as those on Wikipedia, are not the static
grid of values that they seem to be. Rather, they have a life of their
own: they are created under certain circumstances and in certain
webpage locations, they change their shape, they move, they grow,
they shrink, their data changes, they vanish, and they re-appear.
When users look at web tables or when scientists extract data from
them, they are most likely not aware that behind each table lies a
rich history.
For this empirical paper, we have extracted, matched and ana-
lyzed the entire history of all 3.5 M tables on the English Wikipedia
for a total of 53.8 M table versions. Based on this enormous dataset
of public table histories, we provide various analysis results, such
as statistics about lineage sizes, table positions, volatility, change
intervals, schema changes, and their editors. Apart from satisfying Figure 1: An example of an evolving table in Wikipedia
curiosity, analyzing and understanding the change-behavior of web (from https://en.wikipedia.org/?diff=prev&oldid=541341520).
tables serves various use cases, such as identifying out-of-date val-
ues, recognizing systematic changes across tables, and discovering
change dependencies. of purposes, as was recently surveyed in [10], including entity
extraction and fact generation [16], improving web search [21],
Reference Format: and entity linking [19]. Other work seeks to enhance web tables
Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, themselves, such as generating their title [14], generating column
and Divesh Srivastava. The Secret Life of Wikipedia Tables. In the 2nd headers [24], or finding subject columns [5, 25]. Again, all of these
Workshop on Search, Exploration, and Analysis in Heterogeneous approaches make use of table content, headers, and surrounding
Datastores (SEA Data 2021). text and data. Providing more such data, and in particular different
versions of such data, gives these machine learning approaches a
richer input set.
1 EMPIRICAL RESEARCH ON THE WEB In the context of our Janus project [3], we have been extract-
Traditionally, empiricism plays a minor role in the theory- and ing and working with the histories of various structured datasets,
engineering-oriented field of our research community, while it has including DBLP, IMDB, open government data, and in particular
played a significant role in other disciplines of computer science Wikipedia, for which a detailed history of every edit is available.
(e.g., [6, 7]). Hardly ever do we pause to analyze and reflect on the In this empirical paper, we focus on tables as they appear on Wi-
observable, “natural” behavior of data and systems. Among the kipedia pages and report on our various observations across their
notable exceptions is the area of data quality research [22]. lifetime, including their creation (Section 4), their evolution over
One example of observable behavior is that of web tables, in time (Section 5), and ultimately their deletion (Section 6). We re-
particular those that are collaboratively edited. As such, an example port on such varied dimensions as table-counts, users, duration,
of a very heterogeneous and semi-structured data lake is the set edits, table similarity, table position, and of course time itself, high-
of tables on Wikipedia. Such web tables are used for a variety lighting expected and some surprising behavior. Figure 1 shows
∗ Work done while at AT&T Research.
one exemplary evolutionary step (in schema and data) for one of
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
millions of Wikipedia tables.
for the volume as a collection by its editors. This volume and its papers are published Our results can help researchers better understand the volatility
under the Creative Commons License Attribution 4.0 International (CC BY 4.0). of web tables: a given table or corpus snapshot is not a stable basis
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal-
ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, but rather just that: a snapshot with a history of changes leading
Copenhagen, Denmark) on CEUR-WS.org. up to it and a future with many further changes. In fact, at the
time of writing any given Wikipedia table was changed twice in 3 TABLE CORPUS
the past year, on average, but with a standard deviation of 9.1, To explore tables over time, we need to be able to track tables over
and some tables changed multiple times per day. But not only time, which is a non-trivial task as tables and their context can
the content of tables change, also their schema evolves over time. change over time. We consider tables as objects with an identity,
This information about evolving schema can serve, for example, to which in contrast to its concrete shape and content, stays constant
identify synonymous attributes [23]. over time. A table can have multiple versions, where each version is
In the following, we highlight selected analyses in this paper an edit of the previous version. However, tables on the web usually
and outline their possible implications for researchers: lack a stable identifier, which is why we have to infer that identifier.
We proposed a solution for this identity inference through a table
Need for timeliness. Figure 8 shows how quickly a snapshot be-
matching procedure. The details of this work are described in [4],
comes outdated. As a consequence, all models that are trained
where we also evaluate the matching and show that it works much
on static snapshots run the risk of quickly becoming obsolete.
better than related work. Our input for the table change extraction
Efficient methods for updating these are therefore desirable.
process is a dump of web page versions – either snapshots that
Help with maintenance and updating. A large portion of ta-
have been crawled, and specifically for Wikipedia the complete
bles is created and maintained by power-users, as can be
edit history as a set of XML files1 . These XML files contain the
seen in Figures 5 and 10. Knowledge about update patterns
actual version content encoded as Wikitext, a markup language
could be used to notify these editors about (potentially) out-
including table markup, as well as additional metadata for each of
dated values and the need for updates.
the revisions, such as its timestamp, author, and comments.
Suggestions for cleaning and improvement. Figure 16 illustrates
For every page version, we extract a (possibly empty) list of
an example of the inconsistencies that can emerge in tables.
parsed table nodes. This list of table nodes for every page revision
Based on the knowledge of how similar tables evolve, one
constitutes the input for our table matching. For each page revision,
can make concrete suggestions to improve their (in this case)
it is necessary to decide whether the tables therein are versions
schemata, such as to rename or add certain columns.
of previously identified tables or entirely new tables. It is not suf-
ficient to consider only tables of the directly preceding version,
2 RELATED WORK because tables can be deleted and be restored several revisions
We discuss two types of related work: structured datasets with (and sometimes several years) later. To determine the quality of our
history and Wikipedia change analysis. There is a lot of research matching, we have manually created a gold standard of table match-
on web tables, so we can only provide a high-level overview in ings comprising 1,445 tables, with a total of 16,919 distinct table
this short paper and refer to surveys and research papers for more versions selected from 90 pages. We show that the matching works
details. well, matching all versions that belong to a given table correctly
for 88.58% of all tables in the gold standard. For a majority of the
remaining tables, we misclassify only a small number of versions
Related corpora. Wikipedia provides access to its entire version
of individual table histories (1 mistake: 6.09%; 2 mistakes: 2.98%).
history, allowing us to track very fine-grained changes. A variety
Our matching decisions for individual table versions reach >99%
of datasets that also deal with (semi-) structured content have been
𝐹 1 -measure.
extracted from the web and Wikipedia before. Multiple corpora of
We next present various statistics based on the 3,471,609 Wiki-
web tables [12, 18] provide extracts of static versions of tables on
pedia table objects we have collected that have been linked using
the web and have since been subject to extensive research [10]. For
our matching process. These statistics are based on the Wikidump
example, Lautert et al. establish a taxonomy of web tables and thus
of September 1, 2019. Both our gold standard and output dataset
give an insight into the general structure of static web tables in [17],
are available at our project website www.IANVS.org. The differ-
whereas we focus on the temporal evolution of web tables. The
ent statistics and findings are grouped by the three phases of a
infobox history dataset WHAD [2] comprises structured informa-
table’s existence: creation, evolution and deletion. Already these
tion on Wikipedia, namely the changes of infoboxes. This dataset
three phases of existence show that establishing a table identity is
is orthogonal to our dataset, since it does not cover general tables,
essential for the following statistics, because without it, it would
which are more diverse and also more complex in comparison.
not be possible to determine statistics that aggregate on a per-table
basis (but only on a per-version basis).
Analyzing changes in Wikipedia. The content of Wikipedia
has been the subject of much research [20]. While large parts of 4 CREATION
that research were conducted on static snapshots of Wikipedia, a
In this section, we focus on the first insert of every table – its
variety of works analyzes changes on Wikipedia. Both the evolution
creation – even when during its lifetime a table might be deleted
of content [11] and the evolution of the page link graph [8] have
and recreated multiple times.
been studied. Specifically, the study of content evolution can help
detect conflicts [15] or controversy that may result in edit-wars [9]. 4.1 Where and when are tables created?
The edit histories serve as input to event-extraction [13] and are
also valuable for trust assignment [1]. Our approach can provide a Figure 2 shows that Wikipedia pages in their initial years (2001–
better understanding of what has really changed, from which many 2003) had almost no tables. Using tables in Wikipedia became more
of these studies should benefit. 1 https://dumps.wikimedia.org/
2
15k
Number of new tables and pages
663k
60k Pages
100k
327k
10k
Same year
Number of pages
143k
Tables
51k
5k
15k
2015
Mentioned year
1,000
4k
40k 0
734
15k
113
2010
Different year
20k
21
10
4
10k
1
5k
2005
0 1 –3 –7 15 31 63 127 255 511 023 047
2 4 8– 16– 32– 4– 8– 6– –1 –2
6 12 25 512 024 0
2005 2010 2015 2020 1 2005 2010 2015
Creation date Maximum table count Table creation date
Figure 2: Number of tables and pages cre- Figure 3: Histogram of the maximum ta- Figure 4: Correlation of years mentioned
ated per month. ble count per page. in categories and time of creation.
popular only around 2004 and tables were fully adopted by end of
0 100k200k300k400k
13.125%
Number of tables
2006. Since then, every month around 20,000 new tables are cre-
5.286%
ated (about one every two minutes). The hypothesis that insertion
5.178%
10.026%
9.820%
4.058%
9.335%
3.441%
3.364%
frequency would decrease once tables are inserted at all relevant
3.057%
2.636%
7.996%
7.836%
1.588%
locations seems false: While the number of new pages created per
6.782%
6.472%
month drops since 20072 , the insertion-rate of new tables remains
constant. This relative increase in tables per page shows that more
and more data is stored in a structured fashion, raising the relevance
of methods to extract knowledge from said tables.
2–1
43
8––7
16 15
642–61
12 –1 3
2 8– 27
5156–255
2024–1021
4 48 20 3
81096–4047
No63982––8195
t r 4–16391
eg 32 83
er 7
ed
3 –3
10 2– 51
ist 76
We observed separately that most tables are created at the same
time or soon after the page containing them is created. Only for
pages that were created at the beginning of Wikipedia, when tables
1
were not so popular, larger gaps between the page creation and the Number of tables created by user
creation of the first table on the page are common.
Figure 3 shows a histogram of the maximum number of tables Figure 5: Histogram of tables bucketed by the total number
that ever existed simultaneously on a Wikipedia article. The vast of tables an author created.
majority of Wikipedia articles contain only a few tables (we omitted
the even larger number of pages that do not contain any tables at
all). On the other hand, most tables appear on pages together with
other tables. Only 19.1% of all tables appear alone on a Wikipedia
article. The many tables that exist in the vicinity of each other can mentioned year and the year of creation align, for 5.6% of the cases
be assumed to be related in terms of content. the tables are created before the mentioned years and in 43.5% of
On Wikipedia, every article can link to categories, which are the cases, the tables are created after the mentioned year.
used to group related articles to a topic and can themselves be
organized in categories. We investigate how the creation dates of
tables correlate with any year mentioned in these page categories
4.2 Who creates tables?
(such as 2020 for “2020 United States presidential election”), which The distribution of the number of tables that a user creates is shown
we assume to be the relevant years for that table. Figure 4 shows that in Figure 5. Only 13.1% of the tables are created by non-registered
the extracted years and the creation year match for most tables. For users. The figure also clearly shows that tables are more likely to
every mention of a year in the page categories, a table is counted in be created by power-users: More than half of the tables are created
a cell that represents the month of creation (on the x-axis) and the by users who each have also created at least 128 other tables. The
mentioned year (y-axis). If those two dimensions would perfectly record for the highest number of tables created by a single user is
align, we would only see marks close to the diagonal of the plot. 20,194 (in this case on a variety of sports events).
There is a tendency that tables are rather created in the second half A possible explanation for this behavior could be that the effort
of the mentioned year or in the beginning of the following year, and skill it takes to create a new table is too high for many users.
which shows as a small shift to the right in the plot. For those years On the other hand, there are very dedicated users who must have
that are covered by our dataset (2004–2019), in 50.8% of all cases the acquired the necessary skills and possibly tools to create thousands
of tables. One insight that we can take from this observation is that
2 Source: https://stats.wikimedia.org/#/en.wikipedia.org/contributing/new- any random sample of tables is likely to be influenced by those
pages/normal|line|2001-01-01~2020-10-01|page_type~content|monthly power users.
3
10k
Number of tables
2M
1k
Table count
1M
100 100
1k 0
10 10k 2005 2010 2015 2020
100k (a) A general table template
Snapshot date
1 1M
1 to be deleted to be updated
10 100 1k
User count (b) A sport specific table template unchanged to be created
Figure 6: Numbers of users that Figure 7: Two concrete examples of table templates. Figure 8: Missed updates in rela-
use the same template for differ- tion to snapshot date.
ent numbers of tables.
15 plot). These are usually domain-specific templates, such as tables
Time since last update [years]
for sports results or statistics (see Figure 7b for an example).
10
5 EVOLUTION
The second phase in the lifetime of a table is its evolution. This
5 phase encompasses all changes that happen between the initial
creation and the possible final deletion, including changes to data,
to schema, and to shape.
0
0 5 10 15 5.1 How often are tables changed?
Table age [years]
The average table in our corpus is re-inserted 0.62 times, deleted
0.93 times, and updated 13.89 times. Of the 0.62 re-inserts, 0.10 are
Figure 9: Table freshness over time.
fresh table versions, i.e., the table’s content is different from any
previous version, which means 0.52 of the inserts restore previously
existing table versions that were deleted at some prior point in time.
For the 13.89 updates, the ratio of fresh and old versions is 11.97
4.3 How are tables created? fresh versus 1.92 updates that restore previously existing versions.
Creating tables is a tedious job, especially for inexperienced users, While these average numbers seem quite low, there is a large skew:
who might not be familiar with the syntax of tables. An obvious hy- there is a table on social networking websites that was updated
pothesis is therefore that users copy & paste similar tables (created more than 10,000 times during its lifetime. At least 1,310 tables were
by other users or themselves) and adapt them according to their each updated more than 1,000 times during their lifetimes.
needs. To investigate this hypothesis, we studied the frequency Figure 8 shows the number of tables that would have been cre-
with which the same content appears in the first version of differ- ated/updated/deleted by the date of our analyzed snapshot (Sep-
ent tables. For a more accurate picture, we also analyzed how many tember 1, 2019), if the snapshot were taken at a previous point in
users chose to use exactly the same table markup code, presumably time (shown on the x-axis). In a one-month-old snapshot, already
as table templates. 4.4% of tables are outdated. If the snapshot were taken a year earlier,
The first observation we made is that 3,004,883 of the 3,471,609 26.6% of the tables would no longer represent the current state. In
tables (86.6%) in our corpus appear to have unique first versions. a 5 years time range, this number rises to 60.6%.
This does not imply that they are not copied from somewhere else, The violin plot in Figure 9 shows how the change frequency
but were modified prior to the first save of the page. On the other behaves with increasing table age. The shape of a violin plot follows
end of the spectrum, there are templates that are used more than the distribution of the values: the wider the line, the more probable
15,000 times to create tables. the value. Within the violin plot, there are marks at the 0.25/0.5/0.75
As can be seen in Figure 6, the ratio of tables and number of quantiles. In particular, it shows how the time since the last update
users that use the same template greatly varies. Some templates is distributed for tables at different ages. The median rises until a
are used for thousands of tables, but also by thousands of different certain point, after which it stays constant or slightly decreases
users (top right in the plot). This is usually the case for example again. However, the distribution is skewed towards the two ends
tables that contain only dummy values (an example can be seen in of the spectrum: tables either are very frequently updated or are
Figure 7a). However, there are other templates that are also used for hardly ever changed. For example, considering the quantiles for the
thousands of tables, but only by a few dozen users (top left in the 5-year-old tables, more than 25% of these tables were updated in
4
100%
1M
1M
600
653k
100k
334k
Number of tables
155k
75%
Share of edits
65k
New position
24k
8k
50% 400
1,000
2k
367
25%
200
58
1
10
0% 100
1
0 5 10 15 0 10000
Table age [years] 1 –3 –7 15 31 63 127 255 511 023 047 095
2 4 8–16–32–4– 8– 6– –1 –2 –4
6 12 25 512 024 048 0 200 400 600
1 2
By non-creator By creator Number of editors Previous position
Figure 10: Creator update activity for ta- Figure 11: Number of editors per table. Figure 12: A density plot of table posi-
bles created by registered users. tion differences between two consecu-
tive table versions.
the last year and for another 25% the last update was almost more 5.4 How much do tables change?
than four years ago. Figure 13 shows how the content of tables develops over time. More
precisely, it shows a similarity score of each table compared to the
5.2 Who changes tables? first version of that table (calculated on a random subset of 1,000
tables). We use a similarity metric that is based on a Í word vector
Figure 10 shows how long the original creator is active as an updater min(𝑣 ,𝑤 )
of a table. We distinguish between registered and unregistered cre- representation of both table versions: sim(® ® = 𝑖 max(𝑣𝑖 ,𝑤𝑖 ) ,
𝑣 , 𝑤) Í
𝑖 𝑖 𝑖
ators, because for unregistered creators we have only the IP address where 𝑣® and 𝑤® are word vectors of the two table versions that should
as an identifier, which might change from time to time. Therefore, be compared. In general, the similarity is expected to decrease over
it is not too surprising that the share of edits that is done by an time, but it can also rise if the table content becomes more similar
unregistered creator quickly drops and, hence, we exclude tables to its original version. While there are some tables that stay almost
created by unregistered users from this plot. On the other hand, unchanged throughout their lifetime, there are other tables that
for registered users, there are tables that are still updated by the rapidly change within the first few days of their existence. One
original creators years after they have been created. In reality, the reason for this could be that people copy & paste other tables as
influence of the original author on a table could be even higher than templates and then adjust the content, as explained in Section 4.3.
what this plot suggests: the number of edits for a table decreases During their lifetime, 23.6% of all tables either grow or shrink in
over time, so the first buckets contain more edits. the number of columns, 37.0% grow or shrink in the number of rows.
When we look at the number of editors that change individual However, 57.3% of all tables retain their original size throughout
tables in Figure 11, we see that a large number of tables (35%) are their lifetime.
updated by the creator of the table only. In this analysis, we only
consider users as editors of a table who create a new version (a
simple revert to a previous version is not counted). While most of
5.5 How much do schemata change?
the tables are updated by only a few users, there are some excep- About half of all tables never change their schema, as can be seen
tions where thousands of users contribute to the table. Again, the in Figure 14 (note that this is a log-log plot). On the other side,
previously mentioned table on social networking websites holds there are tables that change their schema hundreds of times, up to
the record with contributions by 4235 distinct users. 443 changes. On average, each table has 1.86 schema versions. The
types of schema change can be manifold. For example, columns are
renamed, columns are added or removed.
5.3 Are tables moved? Figure 16 shows a vivid example of how schemata of web tables
Figure 12 shows how much tables move in relation to other tables evolve over time. To create this plot, we created a clustering of
on their page. While for most page revisions, the tables do not schemata based on tables that evolve from one schema to another.
move or move only slightly, there are page revisions for which This particular plot shows a cluster of schemata that all contain
tables move by up to 1,574 positions for a single page (we removed information about league results of football teams. There are almost
this one extreme case as an outlier). We observe that if tables move, 500 tables for which at least one of the snapshots had one of the
this is often due to the insertion or deletion of tables and that tables Schemata 2–7. More than half of those tables followed Schema 6
rather move down on the page (64.09%) than up (35.91%). One at the beginning of 2011, while the other half mostly did not yet
obvious reason for this imbalance is the fact that a table inserted exist (Group 1). The splines show how this schema evolved to
in the middle of the page causes other tables to move down, and many different specializations until 2018. While in some cases these
insertions are more common than deletions. On average, a table’s specializations make sense (such as a clarification about the league
position changes 1.66 times during its lifespan. system), in other cases they are due to inconsistent changes (such as
5
100% 100%
1M deleted
alive
75% 100k 75%
Share of tables
Similarity sim
10k
Count
50% 1k 50%
100
25% 25%
10
0% 1 0%
0 5 10 15 1 2 5 10 20 50 100 200 500 0 5 10 15
Table age [years] Schema versions Table age [years]
Figure 13: Similarity of table versions Figure 14: Number of schema versions Figure 15: Time until deletion.
and the table’s first version. Each line per table.
represents one table.
500
1 NON-EXISTENT
400 2 Season, Division, [[Bavarian football
league system|Tier]], Position
300
3 Season, Division, Tier, Position
Count
4 Year, Division, [[Bavarian football
200 league system|Tier]], Position
5 Year, Division, [[German football
100 league system|Tier]], Position
6 Year, Division, Position
7 Year, Division, Position
0 8 OTHER-SCHEMA
2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01 2017-01 2018-01
Time
Figure 16: Example of schemata evolving over time.
the header “Year”, which after manual inspection should actually be While the vast majority of tables is never deleted (57.2%) or
“Season”, a range spanning two consecutive years, in most cases). deleted only once (29.9%), there is a larger skew in the distribution
As these tables are webtables, the header can also be formatted of deletes. One table that explains the Wiki syntax was deleted 620
differently and we can see that for most tables of Schema 7, the times during its lifetime, mostly from vandalism.
header was changed to bold-type (Schema 6) in between 2009 and
2010. Still, there is a small number of tables that even after almost
a decade still did not make this transition.
7 CONCLUSIONS
6 DELETION In summary, we have seen how fast tables on Wikipedia change and
Figure 15 shows how long tables survive, counting the days from how fast they come and go. When working with this corpus, it is
their creation. The blue part shows the percentage of tables that important to keep this additional temporal dimension in mind and
reached the respective age without being deleted. The green part leverage it when possible. The history also makes other dimensions
represents those tables that have been created long enough ago of the corpus accessible, such as the creators, editors, or templates,
such that they could have reached the respective age, but were which together provide a perspective on the tables that is more
deleted before reaching that age. 69.5% of all tables ever created holistic than single snapshots of individual tables or a table corpus.
have survived until the end-date of our dataset. If a table is deleted, As future work, we plan to explore whether other structured cor-
then this usually happens at the beginning of its lifetime. The longer pora, such as Wikipedia infoboxes or lists, for which we also provide
a table exists, the less likely it becomes that it will be deleted. histories, behave similarly in terms of their dynamics. Furthermore,
From the time a table is created until it is deleted (or until the we want to use the gained insights to assign trust to values and
end-date of the dataset), the average in our table corpus is 4.93 years. improve data quality. We encourage researchers to explore their
In 97.7% of that time, the table is truly part of the page, while in the datasets in a similar manner to uncover hidden information in a
remaining 40.50 days the table is (temporarily) deleted. dataset’s history.
6
REFERENCES (SSDBM). 8:1–8:12.
[1] B. Thomas Adler, Krishnendu Chatterjee, Luca De Alfaro, Marco Faella, Ian [13] Mihai Georgescu, Nattiya Kanhabua, Daniel Krause, Wolfgang Nejdl, and Stefan
Pye, and Vishwanath Raman. 2008. Assigning trust to Wikipedia content. In Siersdorfer. 2013. Extracting event-related information from article updates in
Proceedings of the International Symposium on Wikis (WikiSym). 26:1–26:12. Wikipedia. In Advances in Information Retrieval (ECIR). Springer, 254–266.
[2] Enrique Alfonseca, Guillermo Garrido, Jean-Yves Delort, and Anselmo Peñas. [14] Braden Hancock, Hongrae Lee, and Cong Yu. 2019. Generating Titles for Web
2013. WHAD: Wikipedia historical attributes data - Historical structured data Tables. In Proceedings of the International World Wide Web Conference (WWW).
extraction and vandalism detection from the Wikipedia edit history. Language ACM, 638–647.
Resources and Evaluation 47, 4 (2013), 1163–1190. [15] Aniket Kittur, Bongwon Suh, Bryan A. Pendleton, and Ed H. Chi. 2007. He says,
[3] Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix she says: conflict and coordination in Wikipedia. In Proceedings of the International
Naumann, and Divesh Srivastava. 2018. Exploring Change – A New Dimension Conference on Human Factors in Computing Systems (SIGCHI). 453–462.
of Data Analytics. PVLDB 12, 2 (2018), 85–98. [16] Flip Korn, Xuezhi Wang, You Wu, and Cong Yu. 2019. Automatically Generat-
[4] Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and ing Interesting Facts from Wikipedia Tables. In Proceedings of the International
Divesh Srivastava. 2021. Structured Object Matching across Web Page Revisions. Conference on Management of Data (SIGMOD). 349–361.
In Proceedings of the International Conference on Data Engineering (ICDE). [17] Larissa R. Lautert, Marcelo M. Scheidt, and Carina F. Dorneles. 2013. Web Table
[5] Leon Bornemann, Tobias Bleifuß, Dmitri V Kalashnikov, Felix Naumann, and Di- Taxonomy and Formalization. SIGMOD Record 42, 3 (2013), 28–33.
vesh Srivastava. 2020. Natural Key Discovery in Wikipedia Tables. In Proceedings [18] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016.
of The Web Conference 2020. 2789–2795. A large public corpus of web tables containing time and context metadata. In
[6] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Proceedings of the International Conference Companion on World Wide Web. 75–76.
Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener. 2000. Graph [19] Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking
structure in the Web. Comput. Networks 33, 1-6 (2000), 309–320. temporal records. PVLDB 4, 11 (2011), 956–967.
[7] A.D. Broido and A. Clauset. 2019. Scale-free networks are rare. Nature Commu- [20] Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto
nications 10, 1017 (2019). Lanamäki. 2015. The sum of all human knowledge: A systematic review of
[8] Luciana S. Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano scholarly research on the content of Wikipedia. Journal of the Association for
Millozzi. 2006. Temporal analysis of the wikigraph. In International Conference Information Science and Technology 66, 2 (2015), 219–245.
on Web Intelligence (WI). 45–51. [21] Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the
[9] Siarhei Bykau, Flip Korn, Divesh Srivastava, and Yannis Velegrakis. 2015. Fine- Web Using Column Keywords. PVLDB 5, 10 (2012), 908–919.
grained controversy detection in Wikipedia. In Proceedings of the International [22] Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F.
Conference on Data Engineering (ICDE). 1573–1584. Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh
[10] Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Srivastava. 2017. Data Quality: The Role of Empiricism. SIGMOD Record 46, 4
Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2017), 35–43.
(2018), 2140–2149. [23] Paolo Sottovia, Matteo Paganelli, Francesco Guerra, and Yannis Velegrakis. 2019.
[11] Andrea Ceroni, Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, and Finding Synonymous Attributes in Evolving Wikipedia Infoboxes. In Advances
Marco Fisichella. 2014. Information evolution in Wikipedia. In Proceedings of the in Databases and Information Systems (ADBIS). 169–185.
International Symposium on Open Collaboration (OpenSym). 24:1–24:10. [24] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu. 2012. Un-
[12] Julian Eberius, Maik Thiele, Katrin Braunschweig, and Wolfgang Lehner. 2015. derstanding tables on the web. In Proceedings of the International Conference on
Top-k Entity Augmentation Using Consistent Set Covering. In Proceedings of Conceptual Modeling (ER). Springer, 141–155.
the International Conference on Scientific and Statistical Database Management [25] Ziqi Zhang. 2017. Effective and Efficient Semantic Table Interpretation using
TableMiner+. Semantic Web 8, 6 (2017), 921–957.
7