The Secret Life of Wikipedia Tables Tobias Bleifuß Leon Bornemann Dmitri V. Kalashnikov∗ Hasso Plattner Institute, Hasso Plattner Institute, dmitri.vk@acm.org University of Potsdam, Germany University of Potsdam, Germany tobias.bleifuss@hpi.de leon.bornemann@hpi.de Felix Naumann Divesh Srivastava Hasso Plattner Institute, AT&T Chief Data Office, United States University of Potsdam, Germany divesh@att.com felix.naumann@hpi.de ABSTRACT Tables on the web, such as those on Wikipedia, are not the static grid of values that they seem to be. Rather, they have a life of their own: they are created under certain circumstances and in certain webpage locations, they change their shape, they move, they grow, they shrink, their data changes, they vanish, and they re-appear. When users look at web tables or when scientists extract data from them, they are most likely not aware that behind each table lies a rich history. For this empirical paper, we have extracted, matched and ana- lyzed the entire history of all 3.5 M tables on the English Wikipedia for a total of 53.8 M table versions. Based on this enormous dataset of public table histories, we provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors. Apart from satisfying Figure 1: An example of an evolving table in Wikipedia curiosity, analyzing and understanding the change-behavior of web (from https://en.wikipedia.org/?diff=prev&oldid=541341520). tables serves various use cases, such as identifying out-of-date val- ues, recognizing systematic changes across tables, and discovering change dependencies. of purposes, as was recently surveyed in [10], including entity extraction and fact generation [16], improving web search [21], Reference Format: and entity linking [19]. Other work seeks to enhance web tables Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, themselves, such as generating their title [14], generating column and Divesh Srivastava. The Secret Life of Wikipedia Tables. In the 2nd headers [24], or finding subject columns [5, 25]. Again, all of these Workshop on Search, Exploration, and Analysis in Heterogeneous approaches make use of table content, headers, and surrounding Datastores (SEA Data 2021). text and data. Providing more such data, and in particular different versions of such data, gives these machine learning approaches a richer input set. 1 EMPIRICAL RESEARCH ON THE WEB In the context of our Janus project [3], we have been extract- Traditionally, empiricism plays a minor role in the theory- and ing and working with the histories of various structured datasets, engineering-oriented field of our research community, while it has including DBLP, IMDB, open government data, and in particular played a significant role in other disciplines of computer science Wikipedia, for which a detailed history of every edit is available. (e.g., [6, 7]). Hardly ever do we pause to analyze and reflect on the In this empirical paper, we focus on tables as they appear on Wi- observable, “natural” behavior of data and systems. Among the kipedia pages and report on our various observations across their notable exceptions is the area of data quality research [22]. lifetime, including their creation (Section 4), their evolution over One example of observable behavior is that of web tables, in time (Section 5), and ultimately their deletion (Section 6). We re- particular those that are collaboratively edited. As such, an example port on such varied dimensions as table-counts, users, duration, of a very heterogeneous and semi-structured data lake is the set edits, table similarity, table position, and of course time itself, high- of tables on Wikipedia. Such web tables are used for a variety lighting expected and some surprising behavior. Figure 1 shows ∗ Work done while at AT&T Research. one exemplary evolutionary step (in schema and data) for one of Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021 millions of Wikipedia tables. for the volume as a collection by its editors. This volume and its papers are published Our results can help researchers better understand the volatility under the Creative Commons License Attribution 4.0 International (CC BY 4.0). of web tables: a given table or corpus snapshot is not a stable basis Published in the Proceedings of the 2nd Workshop on Search, Exploration, and Anal- ysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021, but rather just that: a snapshot with a history of changes leading Copenhagen, Denmark) on CEUR-WS.org. up to it and a future with many further changes. In fact, at the time of writing any given Wikipedia table was changed twice in 3 TABLE CORPUS the past year, on average, but with a standard deviation of 9.1, To explore tables over time, we need to be able to track tables over and some tables changed multiple times per day. But not only time, which is a non-trivial task as tables and their context can the content of tables change, also their schema evolves over time. change over time. We consider tables as objects with an identity, This information about evolving schema can serve, for example, to which in contrast to its concrete shape and content, stays constant identify synonymous attributes [23]. over time. A table can have multiple versions, where each version is In the following, we highlight selected analyses in this paper an edit of the previous version. However, tables on the web usually and outline their possible implications for researchers: lack a stable identifier, which is why we have to infer that identifier. We proposed a solution for this identity inference through a table Need for timeliness. Figure 8 shows how quickly a snapshot be- matching procedure. The details of this work are described in [4], comes outdated. As a consequence, all models that are trained where we also evaluate the matching and show that it works much on static snapshots run the risk of quickly becoming obsolete. better than related work. Our input for the table change extraction Efficient methods for updating these are therefore desirable. process is a dump of web page versions – either snapshots that Help with maintenance and updating. A large portion of ta- have been crawled, and specifically for Wikipedia the complete bles is created and maintained by power-users, as can be edit history as a set of XML files1 . These XML files contain the seen in Figures 5 and 10. Knowledge about update patterns actual version content encoded as Wikitext, a markup language could be used to notify these editors about (potentially) out- including table markup, as well as additional metadata for each of dated values and the need for updates. the revisions, such as its timestamp, author, and comments. Suggestions for cleaning and improvement. Figure 16 illustrates For every page version, we extract a (possibly empty) list of an example of the inconsistencies that can emerge in tables. parsed table nodes. This list of table nodes for every page revision Based on the knowledge of how similar tables evolve, one constitutes the input for our table matching. For each page revision, can make concrete suggestions to improve their (in this case) it is necessary to decide whether the tables therein are versions schemata, such as to rename or add certain columns. of previously identified tables or entirely new tables. It is not suf- ficient to consider only tables of the directly preceding version, 2 RELATED WORK because tables can be deleted and be restored several revisions We discuss two types of related work: structured datasets with (and sometimes several years) later. To determine the quality of our history and Wikipedia change analysis. There is a lot of research matching, we have manually created a gold standard of table match- on web tables, so we can only provide a high-level overview in ings comprising 1,445 tables, with a total of 16,919 distinct table this short paper and refer to surveys and research papers for more versions selected from 90 pages. We show that the matching works details. well, matching all versions that belong to a given table correctly for 88.58% of all tables in the gold standard. For a majority of the remaining tables, we misclassify only a small number of versions Related corpora. Wikipedia provides access to its entire version of individual table histories (1 mistake: 6.09%; 2 mistakes: 2.98%). history, allowing us to track very fine-grained changes. A variety Our matching decisions for individual table versions reach >99% of datasets that also deal with (semi-) structured content have been 𝐹 1 -measure. extracted from the web and Wikipedia before. Multiple corpora of We next present various statistics based on the 3,471,609 Wiki- web tables [12, 18] provide extracts of static versions of tables on pedia table objects we have collected that have been linked using the web and have since been subject to extensive research [10]. For our matching process. These statistics are based on the Wikidump example, Lautert et al. establish a taxonomy of web tables and thus of September 1, 2019. Both our gold standard and output dataset give an insight into the general structure of static web tables in [17], are available at our project website www.IANVS.org. The differ- whereas we focus on the temporal evolution of web tables. The ent statistics and findings are grouped by the three phases of a infobox history dataset WHAD [2] comprises structured informa- table’s existence: creation, evolution and deletion. Already these tion on Wikipedia, namely the changes of infoboxes. This dataset three phases of existence show that establishing a table identity is is orthogonal to our dataset, since it does not cover general tables, essential for the following statistics, because without it, it would which are more diverse and also more complex in comparison. not be possible to determine statistics that aggregate on a per-table basis (but only on a per-version basis). Analyzing changes in Wikipedia. The content of Wikipedia has been the subject of much research [20]. While large parts of 4 CREATION that research were conducted on static snapshots of Wikipedia, a In this section, we focus on the first insert of every table – its variety of works analyzes changes on Wikipedia. Both the evolution creation – even when during its lifetime a table might be deleted of content [11] and the evolution of the page link graph [8] have and recreated multiple times. been studied. Specifically, the study of content evolution can help detect conflicts [15] or controversy that may result in edit-wars [9]. 4.1 Where and when are tables created? The edit histories serve as input to event-extraction [13] and are also valuable for trust assignment [1]. Our approach can provide a Figure 2 shows that Wikipedia pages in their initial years (2001– better understanding of what has really changed, from which many 2003) had almost no tables. Using tables in Wikipedia became more of these studies should benefit. 1 https://dumps.wikimedia.org/ 2 15k Number of new tables and pages 663k 60k Pages 100k 327k 10k Same year Number of pages 143k Tables 51k 5k 15k 2015 Mentioned year 1,000 4k 40k 0 734 15k 113 2010 Different year 20k 21 10 4 10k 1 5k 2005 0 1 –3 –7 15 31 63 127 255 511 023 047 2 4 8– 16– 32– 4– 8– 6– –1 –2 6 12 25 512 024 0 2005 2010 2015 2020 1 2005 2010 2015 Creation date Maximum table count Table creation date Figure 2: Number of tables and pages cre- Figure 3: Histogram of the maximum ta- Figure 4: Correlation of years mentioned ated per month. ble count per page. in categories and time of creation. popular only around 2004 and tables were fully adopted by end of 0 100k200k300k400k 13.125% Number of tables 2006. Since then, every month around 20,000 new tables are cre- 5.286% ated (about one every two minutes). The hypothesis that insertion 5.178% 10.026% 9.820% 4.058% 9.335% 3.441% 3.364% frequency would decrease once tables are inserted at all relevant 3.057% 2.636% 7.996% 7.836% 1.588% locations seems false: While the number of new pages created per 6.782% 6.472% month drops since 20072 , the insertion-rate of new tables remains constant. This relative increase in tables per page shows that more and more data is stored in a structured fashion, raising the relevance of methods to extract knowledge from said tables. 2–1 43 8––7 16 15 642–61 12 –1 3 2 8– 27 5156–255 2024–1021 4 48 20 3 81096–4047 No63982––8195 t r 4–16391 eg 32 83 er 7 ed 3 –3 10 2– 51 ist 76 We observed separately that most tables are created at the same time or soon after the page containing them is created. Only for pages that were created at the beginning of Wikipedia, when tables 1 were not so popular, larger gaps between the page creation and the Number of tables created by user creation of the first table on the page are common. Figure 3 shows a histogram of the maximum number of tables Figure 5: Histogram of tables bucketed by the total number that ever existed simultaneously on a Wikipedia article. The vast of tables an author created. majority of Wikipedia articles contain only a few tables (we omitted the even larger number of pages that do not contain any tables at all). On the other hand, most tables appear on pages together with other tables. Only 19.1% of all tables appear alone on a Wikipedia article. The many tables that exist in the vicinity of each other can mentioned year and the year of creation align, for 5.6% of the cases be assumed to be related in terms of content. the tables are created before the mentioned years and in 43.5% of On Wikipedia, every article can link to categories, which are the cases, the tables are created after the mentioned year. used to group related articles to a topic and can themselves be organized in categories. We investigate how the creation dates of tables correlate with any year mentioned in these page categories 4.2 Who creates tables? (such as 2020 for “2020 United States presidential election”), which The distribution of the number of tables that a user creates is shown we assume to be the relevant years for that table. Figure 4 shows that in Figure 5. Only 13.1% of the tables are created by non-registered the extracted years and the creation year match for most tables. For users. The figure also clearly shows that tables are more likely to every mention of a year in the page categories, a table is counted in be created by power-users: More than half of the tables are created a cell that represents the month of creation (on the x-axis) and the by users who each have also created at least 128 other tables. The mentioned year (y-axis). If those two dimensions would perfectly record for the highest number of tables created by a single user is align, we would only see marks close to the diagonal of the plot. 20,194 (in this case on a variety of sports events). There is a tendency that tables are rather created in the second half A possible explanation for this behavior could be that the effort of the mentioned year or in the beginning of the following year, and skill it takes to create a new table is too high for many users. which shows as a small shift to the right in the plot. For those years On the other hand, there are very dedicated users who must have that are covered by our dataset (2004–2019), in 50.8% of all cases the acquired the necessary skills and possibly tools to create thousands of tables. One insight that we can take from this observation is that 2 Source: https://stats.wikimedia.org/#/en.wikipedia.org/contributing/new- any random sample of tables is likely to be influenced by those pages/normal|line|2001-01-01~2020-10-01|page_type~content|monthly power users. 3 10k Number of tables 2M 1k Table count 1M 100 100 1k 0 10 10k 2005 2010 2015 2020 100k (a) A general table template Snapshot date 1 1M 1 to be deleted to be updated 10 100 1k User count (b) A sport specific table template unchanged to be created Figure 6: Numbers of users that Figure 7: Two concrete examples of table templates. Figure 8: Missed updates in rela- use the same template for differ- tion to snapshot date. ent numbers of tables. 15 plot). These are usually domain-specific templates, such as tables Time since last update [years] for sports results or statistics (see Figure 7b for an example). 10 5 EVOLUTION The second phase in the lifetime of a table is its evolution. This 5 phase encompasses all changes that happen between the initial creation and the possible final deletion, including changes to data, to schema, and to shape. 0 0 5 10 15 5.1 How often are tables changed? Table age [years] The average table in our corpus is re-inserted 0.62 times, deleted 0.93 times, and updated 13.89 times. Of the 0.62 re-inserts, 0.10 are Figure 9: Table freshness over time. fresh table versions, i.e., the table’s content is different from any previous version, which means 0.52 of the inserts restore previously existing table versions that were deleted at some prior point in time. For the 13.89 updates, the ratio of fresh and old versions is 11.97 4.3 How are tables created? fresh versus 1.92 updates that restore previously existing versions. Creating tables is a tedious job, especially for inexperienced users, While these average numbers seem quite low, there is a large skew: who might not be familiar with the syntax of tables. An obvious hy- there is a table on social networking websites that was updated pothesis is therefore that users copy & paste similar tables (created more than 10,000 times during its lifetime. At least 1,310 tables were by other users or themselves) and adapt them according to their each updated more than 1,000 times during their lifetimes. needs. To investigate this hypothesis, we studied the frequency Figure 8 shows the number of tables that would have been cre- with which the same content appears in the first version of differ- ated/updated/deleted by the date of our analyzed snapshot (Sep- ent tables. For a more accurate picture, we also analyzed how many tember 1, 2019), if the snapshot were taken at a previous point in users chose to use exactly the same table markup code, presumably time (shown on the x-axis). In a one-month-old snapshot, already as table templates. 4.4% of tables are outdated. If the snapshot were taken a year earlier, The first observation we made is that 3,004,883 of the 3,471,609 26.6% of the tables would no longer represent the current state. In tables (86.6%) in our corpus appear to have unique first versions. a 5 years time range, this number rises to 60.6%. This does not imply that they are not copied from somewhere else, The violin plot in Figure 9 shows how the change frequency but were modified prior to the first save of the page. On the other behaves with increasing table age. The shape of a violin plot follows end of the spectrum, there are templates that are used more than the distribution of the values: the wider the line, the more probable 15,000 times to create tables. the value. Within the violin plot, there are marks at the 0.25/0.5/0.75 As can be seen in Figure 6, the ratio of tables and number of quantiles. In particular, it shows how the time since the last update users that use the same template greatly varies. Some templates is distributed for tables at different ages. The median rises until a are used for thousands of tables, but also by thousands of different certain point, after which it stays constant or slightly decreases users (top right in the plot). This is usually the case for example again. However, the distribution is skewed towards the two ends tables that contain only dummy values (an example can be seen in of the spectrum: tables either are very frequently updated or are Figure 7a). However, there are other templates that are also used for hardly ever changed. For example, considering the quantiles for the thousands of tables, but only by a few dozen users (top left in the 5-year-old tables, more than 25% of these tables were updated in 4 100% 1M 1M 600 653k 100k 334k Number of tables 155k 75% Share of edits 65k New position 24k 8k 50% 400 1,000 2k 367 25% 200 58 1 10 0% 100 1 0 5 10 15 0 10000 Table age [years] 1 –3 –7 15 31 63 127 255 511 023 047 095 2 4 8–16–32–4– 8– 6– –1 –2 –4 6 12 25 512 024 048 0 200 400 600 1 2 By non-creator By creator Number of editors Previous position Figure 10: Creator update activity for ta- Figure 11: Number of editors per table. Figure 12: A density plot of table posi- bles created by registered users. tion differences between two consecu- tive table versions. the last year and for another 25% the last update was almost more 5.4 How much do tables change? than four years ago. Figure 13 shows how the content of tables develops over time. More precisely, it shows a similarity score of each table compared to the 5.2 Who changes tables? first version of that table (calculated on a random subset of 1,000 tables). We use a similarity metric that is based on a Í word vector Figure 10 shows how long the original creator is active as an updater min(𝑣 ,𝑤 ) of a table. We distinguish between registered and unregistered cre- representation of both table versions: sim(® ® = 𝑖 max(𝑣𝑖 ,𝑤𝑖 ) , 𝑣 , 𝑤) Í 𝑖 𝑖 𝑖 ators, because for unregistered creators we have only the IP address where 𝑣® and 𝑤® are word vectors of the two table versions that should as an identifier, which might change from time to time. Therefore, be compared. In general, the similarity is expected to decrease over it is not too surprising that the share of edits that is done by an time, but it can also rise if the table content becomes more similar unregistered creator quickly drops and, hence, we exclude tables to its original version. While there are some tables that stay almost created by unregistered users from this plot. On the other hand, unchanged throughout their lifetime, there are other tables that for registered users, there are tables that are still updated by the rapidly change within the first few days of their existence. One original creators years after they have been created. In reality, the reason for this could be that people copy & paste other tables as influence of the original author on a table could be even higher than templates and then adjust the content, as explained in Section 4.3. what this plot suggests: the number of edits for a table decreases During their lifetime, 23.6% of all tables either grow or shrink in over time, so the first buckets contain more edits. the number of columns, 37.0% grow or shrink in the number of rows. When we look at the number of editors that change individual However, 57.3% of all tables retain their original size throughout tables in Figure 11, we see that a large number of tables (35%) are their lifetime. updated by the creator of the table only. In this analysis, we only consider users as editors of a table who create a new version (a simple revert to a previous version is not counted). While most of 5.5 How much do schemata change? the tables are updated by only a few users, there are some excep- About half of all tables never change their schema, as can be seen tions where thousands of users contribute to the table. Again, the in Figure 14 (note that this is a log-log plot). On the other side, previously mentioned table on social networking websites holds there are tables that change their schema hundreds of times, up to the record with contributions by 4235 distinct users. 443 changes. On average, each table has 1.86 schema versions. The types of schema change can be manifold. For example, columns are renamed, columns are added or removed. 5.3 Are tables moved? Figure 16 shows a vivid example of how schemata of web tables Figure 12 shows how much tables move in relation to other tables evolve over time. To create this plot, we created a clustering of on their page. While for most page revisions, the tables do not schemata based on tables that evolve from one schema to another. move or move only slightly, there are page revisions for which This particular plot shows a cluster of schemata that all contain tables move by up to 1,574 positions for a single page (we removed information about league results of football teams. There are almost this one extreme case as an outlier). We observe that if tables move, 500 tables for which at least one of the snapshots had one of the this is often due to the insertion or deletion of tables and that tables Schemata 2–7. More than half of those tables followed Schema 6 rather move down on the page (64.09%) than up (35.91%). One at the beginning of 2011, while the other half mostly did not yet obvious reason for this imbalance is the fact that a table inserted exist (Group 1). The splines show how this schema evolved to in the middle of the page causes other tables to move down, and many different specializations until 2018. While in some cases these insertions are more common than deletions. On average, a table’s specializations make sense (such as a clarification about the league position changes 1.66 times during its lifespan. system), in other cases they are due to inconsistent changes (such as 5 100% 100% 1M deleted alive 75% 100k 75% Share of tables Similarity sim 10k Count 50% 1k 50% 100 25% 25% 10 0% 1 0% 0 5 10 15 1 2 5 10 20 50 100 200 500 0 5 10 15 Table age [years] Schema versions Table age [years] Figure 13: Similarity of table versions Figure 14: Number of schema versions Figure 15: Time until deletion. and the table’s first version. Each line per table. represents one table. 500 1 NON-EXISTENT 400 2 Season, Division, [[Bavarian football league system|Tier]], Position 300 3 Season, Division, Tier, Position Count 4 Year, Division, [[Bavarian football 200 league system|Tier]], Position 5 Year, Division, [[German football 100 league system|Tier]], Position 6 Year, Division, Position 7 Year, Division, Position 0 8 OTHER-SCHEMA 2008-01 2009-01 2010-01 2011-01 2012-01 2013-01 2014-01 2015-01 2016-01 2017-01 2018-01 Time Figure 16: Example of schemata evolving over time. the header “Year”, which after manual inspection should actually be While the vast majority of tables is never deleted (57.2%) or “Season”, a range spanning two consecutive years, in most cases). deleted only once (29.9%), there is a larger skew in the distribution As these tables are webtables, the header can also be formatted of deletes. One table that explains the Wiki syntax was deleted 620 differently and we can see that for most tables of Schema 7, the times during its lifetime, mostly from vandalism. header was changed to bold-type (Schema 6) in between 2009 and 2010. Still, there is a small number of tables that even after almost a decade still did not make this transition. 7 CONCLUSIONS 6 DELETION In summary, we have seen how fast tables on Wikipedia change and Figure 15 shows how long tables survive, counting the days from how fast they come and go. When working with this corpus, it is their creation. The blue part shows the percentage of tables that important to keep this additional temporal dimension in mind and reached the respective age without being deleted. The green part leverage it when possible. The history also makes other dimensions represents those tables that have been created long enough ago of the corpus accessible, such as the creators, editors, or templates, such that they could have reached the respective age, but were which together provide a perspective on the tables that is more deleted before reaching that age. 69.5% of all tables ever created holistic than single snapshots of individual tables or a table corpus. have survived until the end-date of our dataset. If a table is deleted, As future work, we plan to explore whether other structured cor- then this usually happens at the beginning of its lifetime. The longer pora, such as Wikipedia infoboxes or lists, for which we also provide a table exists, the less likely it becomes that it will be deleted. histories, behave similarly in terms of their dynamics. Furthermore, From the time a table is created until it is deleted (or until the we want to use the gained insights to assign trust to values and end-date of the dataset), the average in our table corpus is 4.93 years. improve data quality. We encourage researchers to explore their In 97.7% of that time, the table is truly part of the page, while in the datasets in a similar manner to uncover hidden information in a remaining 40.50 days the table is (temporarily) deleted. dataset’s history. 6 REFERENCES (SSDBM). 8:1–8:12. [1] B. Thomas Adler, Krishnendu Chatterjee, Luca De Alfaro, Marco Faella, Ian [13] Mihai Georgescu, Nattiya Kanhabua, Daniel Krause, Wolfgang Nejdl, and Stefan Pye, and Vishwanath Raman. 2008. Assigning trust to Wikipedia content. In Siersdorfer. 2013. Extracting event-related information from article updates in Proceedings of the International Symposium on Wikis (WikiSym). 26:1–26:12. Wikipedia. In Advances in Information Retrieval (ECIR). Springer, 254–266. [2] Enrique Alfonseca, Guillermo Garrido, Jean-Yves Delort, and Anselmo Peñas. [14] Braden Hancock, Hongrae Lee, and Cong Yu. 2019. Generating Titles for Web 2013. WHAD: Wikipedia historical attributes data - Historical structured data Tables. In Proceedings of the International World Wide Web Conference (WWW). extraction and vandalism detection from the Wikipedia edit history. Language ACM, 638–647. Resources and Evaluation 47, 4 (2013), 1163–1190. [15] Aniket Kittur, Bongwon Suh, Bryan A. Pendleton, and Ed H. Chi. 2007. He says, [3] Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix she says: conflict and coordination in Wikipedia. In Proceedings of the International Naumann, and Divesh Srivastava. 2018. Exploring Change – A New Dimension Conference on Human Factors in Computing Systems (SIGCHI). 453–462. of Data Analytics. PVLDB 12, 2 (2018), 85–98. [16] Flip Korn, Xuezhi Wang, You Wu, and Cong Yu. 2019. Automatically Generat- [4] Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and ing Interesting Facts from Wikipedia Tables. In Proceedings of the International Divesh Srivastava. 2021. Structured Object Matching across Web Page Revisions. Conference on Management of Data (SIGMOD). 349–361. In Proceedings of the International Conference on Data Engineering (ICDE). [17] Larissa R. Lautert, Marcelo M. Scheidt, and Carina F. Dorneles. 2013. Web Table [5] Leon Bornemann, Tobias Bleifuß, Dmitri V Kalashnikov, Felix Naumann, and Di- Taxonomy and Formalization. SIGMOD Record 42, 3 (2013), 28–33. vesh Srivastava. 2020. Natural Key Discovery in Wikipedia Tables. In Proceedings [18] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. of The Web Conference 2020. 2789–2795. A large public corpus of web tables containing time and context metadata. In [6] Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Proceedings of the International Conference Companion on World Wide Web. 75–76. Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener. 2000. Graph [19] Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking structure in the Web. Comput. Networks 33, 1-6 (2000), 309–320. temporal records. PVLDB 4, 11 (2011), 956–967. [7] A.D. Broido and A. Clauset. 2019. Scale-free networks are rare. Nature Commu- [20] Mostafa Mesgari, Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and Arto nications 10, 1017 (2019). Lanamäki. 2015. The sum of all human knowledge: A systematic review of [8] Luciana S. Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano scholarly research on the content of Wikipedia. Journal of the Association for Millozzi. 2006. Temporal analysis of the wikigraph. In International Conference Information Science and Technology 66, 2 (2015), 219–245. on Web Intelligence (WI). 45–51. [21] Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the [9] Siarhei Bykau, Flip Korn, Divesh Srivastava, and Yannis Velegrakis. 2015. Fine- Web Using Column Keywords. PVLDB 5, 10 (2012), 908–919. grained controversy detection in Wikipedia. In Proceedings of the International [22] Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F. Conference on Data Engineering (ICDE). 1573–1584. Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh [10] Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Srivastava. 2017. Data Quality: The Role of Empiricism. SIGMOD Record 46, 4 Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2017), 35–43. (2018), 2140–2149. [23] Paolo Sottovia, Matteo Paganelli, Francesco Guerra, and Yannis Velegrakis. 2019. [11] Andrea Ceroni, Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, and Finding Synonymous Attributes in Evolving Wikipedia Infoboxes. In Advances Marco Fisichella. 2014. Information evolution in Wikipedia. In Proceedings of the in Databases and Information Systems (ADBIS). 169–185. International Symposium on Open Collaboration (OpenSym). 24:1–24:10. [24] Jingjing Wang, Haixun Wang, Zhongyuan Wang, and Kenny Q. Zhu. 2012. Un- [12] Julian Eberius, Maik Thiele, Katrin Braunschweig, and Wolfgang Lehner. 2015. derstanding tables on the web. In Proceedings of the International Conference on Top-k Entity Augmentation Using Consistent Set Covering. In Proceedings of Conceptual Modeling (ER). Springer, 141–155. the International Conference on Scientific and Statistical Database Management [25] Ziqi Zhang. 2017. Effective and Efficient Semantic Table Interpretation using TableMiner+. Semantic Web 8, 6 (2017), 921–957. 7