<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Secret Life of Wikipedia Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tobias Bleifuß</string-name>
          <email>tobias.bleifuss@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leon Bornemann</string-name>
          <email>leon.bornemann@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Naumann</string-name>
          <email>felix.naumann@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Divesh Srivastava</string-name>
          <email>divesh@att.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reference Format:</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AT&amp;T Chief Data Ofice</institution>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasso Plattner Institute, University of Potsdam</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. The Secret Life of Wikipedia Tables. In the 2nd</institution>
          ,
          <addr-line>Workshop on Search, Exploration, and Analysis in Heterogeneous, Datastores (SEA Data 2021).</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Tables on the web, such as those on Wikipedia, are not the static grid of values that they seem to be. Rather, they have a life of their own: they are created under certain circumstances and in certain webpage locations, they change their shape, they move, they grow, they shrink, their data changes, they vanish, and they re-appear. When users look at web tables or when scientists extract data from them, they are most likely not aware that behind each table lies a rich history. For this empirical paper, we have extracted, matched and analyzed the entire history of all 3.5 M tables on the English Wikipedia for a total of 53.8 M table versions. Based on this enormous dataset of public table histories, we provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors. Apart from satisfying curiosity, analyzing and understanding the change-behavior of web tables serves various use cases, such as identifying out-of-date values, recognizing systematic changes across tables, and discovering change dependencies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>EMPIRICAL RESEARCH ON THE WEB</title>
      <p>
        Traditionally, empiricism plays a minor role in the theory- and
engineering-oriented field of our research community, while it has
played a significant role in other disciplines of computer science
(e.g., [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]). Hardly ever do we pause to analyze and reflect on the
observable, “natural” behavior of data and systems. Among the
notable exceptions is the area of data quality research [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>One example of observable behavior is that of web tables, in
particular those that are collaboratively edited. As such, an example
of a very heterogeneous and semi-structured data lake is the set
of tables on Wikipedia. Such web tables are used for a variety
∗Work done while at AT&amp;T Research.</p>
      <p>
        Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
for the volume as a collection by its editors. This volume and its papers are published
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and
Analysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,
Copenhagen, Denmark) on CEUR-WS.org.
of purposes, as was recently surveyed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], including entity
extraction and fact generation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], improving web search [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
and entity linking [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Other work seeks to enhance web tables
themselves, such as generating their title [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], generating column
headers [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], or finding subject columns [
        <xref ref-type="bibr" rid="ref25 ref5">5, 25</xref>
        ]. Again, all of these
approaches make use of table content, headers, and surrounding
text and data. Providing more such data, and in particular diferent
versions of such data, gives these machine learning approaches a
richer input set.
      </p>
      <p>
        In the context of our Janus project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we have been
extracting and working with the histories of various structured datasets,
including DBLP, IMDB, open government data, and in particular
Wikipedia, for which a detailed history of every edit is available.
In this empirical paper, we focus on tables as they appear on
Wikipedia pages and report on our various observations across their
lifetime, including their creation (Section 4), their evolution over
time (Section 5), and ultimately their deletion (Section 6). We
report on such varied dimensions as table-counts, users, duration,
edits, table similarity, table position, and of course time itself,
highlighting expected and some surprising behavior. Figure 1 shows
one exemplary evolutionary step (in schema and data) for one of
millions of Wikipedia tables.
      </p>
      <p>
        Our results can help researchers better understand the volatility
of web tables: a given table or corpus snapshot is not a stable basis
but rather just that: a snapshot with a history of changes leading
up to it and a future with many further changes. In fact, at the
time of writing any given Wikipedia table was changed twice in
the past year, on average, but with a standard deviation of 9.1,
and some tables changed multiple times per day. But not only
the content of tables change, also their schema evolves over time.
This information about evolving schema can serve, for example, to
identify synonymous attributes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>In the following, we highlight selected analyses in this paper
and outline their possible implications for researchers:
Need for timeliness. Figure 8 shows how quickly a snapshot
becomes outdated. As a consequence, all models that are trained
on static snapshots run the risk of quickly becoming obsolete.</p>
      <p>Eficient methods for updating these are therefore desirable.</p>
      <sec id="sec-1-1">
        <title>Help with maintenance and updating. A large portion of ta</title>
        <p>bles is created and maintained by power-users, as can be
seen in Figures 5 and 10. Knowledge about update patterns
could be used to notify these editors about (potentially)
outdated values and the need for updates.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Suggestions for cleaning and improvement. Figure 16 illustrates</title>
        <p>an example of the inconsistencies that can emerge in tables.
Based on the knowledge of how similar tables evolve, one
can make concrete suggestions to improve their (in this case)
schemata, such as to rename or add certain columns.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>We discuss two types of related work: structured datasets with
history and Wikipedia change analysis. There is a lot of research
on web tables, so we can only provide a high-level overview in
this short paper and refer to surveys and research papers for more
details.</p>
      <p>
        Related corpora. Wikipedia provides access to its entire version
history, allowing us to track very fine-grained changes. A variety
of datasets that also deal with (semi-) structured content have been
extracted from the web and Wikipedia before. Multiple corpora of
web tables [
        <xref ref-type="bibr" rid="ref12 ref18">12, 18</xref>
        ] provide extracts of static versions of tables on
the web and have since been subject to extensive research [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For
example, Lautert et al. establish a taxonomy of web tables and thus
give an insight into the general structure of static web tables in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
whereas we focus on the temporal evolution of web tables. The
infobox history dataset WHAD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] comprises structured
information on Wikipedia, namely the changes of infoboxes. This dataset
is orthogonal to our dataset, since it does not cover general tables,
which are more diverse and also more complex in comparison.
Analyzing changes in Wikipedia. The content of Wikipedia
has been the subject of much research [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. While large parts of
that research were conducted on static snapshots of Wikipedia, a
variety of works analyzes changes on Wikipedia. Both the evolution
of content [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the evolution of the page link graph [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have
been studied. Specifically, the study of content evolution can help
detect conflicts [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or controversy that may result in edit-wars [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The edit histories serve as input to event-extraction [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and are
also valuable for trust assignment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our approach can provide a
better understanding of what has really changed, from which many
of these studies should benefit.
      </p>
      <p>TABLE CORPUS
To explore tables over time, we need to be able to track tables over
time, which is a non-trivial task as tables and their context can
change over time. We consider tables as objects with an identity,
which in contrast to its concrete shape and content, stays constant
over time. A table can have multiple versions, where each version is
an edit of the previous version. However, tables on the web usually
lack a stable identifier, which is why we have to infer that identifier.</p>
      <p>
        We proposed a solution for this identity inference through a table
matching procedure. The details of this work are described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
where we also evaluate the matching and show that it works much
better than related work. Our input for the table change extraction
process is a dump of web page versions – either snapshots that
have been crawled, and specifically for Wikipedia the complete
edit history as a set of XML files 1. These XML files contain the
actual version content encoded as Wikitext, a markup language
including table markup, as well as additional metadata for each of
the revisions, such as its timestamp, author, and comments.
      </p>
      <p>For every page version, we extract a (possibly empty) list of
parsed table nodes. This list of table nodes for every page revision
constitutes the input for our table matching. For each page revision,
it is necessary to decide whether the tables therein are versions
of previously identified tables or entirely new tables. It is not
sufifcient to consider only tables of the directly preceding version,
because tables can be deleted and be restored several revisions
(and sometimes several years) later. To determine the quality of our
matching, we have manually created a gold standard of table
matchings comprising 1,445 tables, with a total of 16,919 distinct table
versions selected from 90 pages. We show that the matching works
well, matching all versions that belong to a given table correctly
for 88.58% of all tables in the gold standard. For a majority of the
remaining tables, we misclassify only a small number of versions
of individual table histories (1 mistake: 6.09%; 2 mistakes: 2.98%).
Our matching decisions for individual table versions reach &gt;99%
1-measure.</p>
      <p>We next present various statistics based on the 3,471,609
Wikipedia table objects we have collected that have been linked using
our matching process. These statistics are based on the Wikidump
of September 1, 2019. Both our gold standard and output dataset
are available at our project website www.IANVS.org. The
diferent statistics and findings are grouped by the three phases of a
table’s existence: creation, evolution and deletion. Already these
three phases of existence show that establishing a table identity is
essential for the following statistics, because without it, it would
not be possible to determine statistics that aggregate on a per-table
basis (but only on a per-version basis).
4</p>
    </sec>
    <sec id="sec-3">
      <title>CREATION</title>
      <p>In this section, we focus on the first insert of every table – its
creation – even when during its lifetime a table might be deleted
and recreated multiple times.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Where and when are tables created?</title>
      <p>10k
5k
0
15k
10k
5k
0
s
e
agp60k
d
n
a
s
le40k
b
a
t
w
e
fn20k
o
r
e
b
um 0</p>
      <p>N
popular only around 2004 and tables were fully adopted by end of
2006. Since then, every month around 20,000 new tables are
created (about one every two minutes). The hypothesis that insertion
frequency would decrease once tables are inserted at all relevant
locations seems false: While the number of new pages created per
month drops since 20072, the insertion-rate of new tables remains
constant. This relative increase in tables per page shows that more
and more data is stored in a structured fashion, raising the relevance
of methods to extract knowledge from said tables.</p>
      <p>We observed separately that most tables are created at the same
time or soon after the page containing them is created. Only for
pages that were created at the beginning of Wikipedia, when tables
were not so popular, larger gaps between the page creation and the
creation of the first table on the page are common.</p>
      <p>Figure 3 shows a histogram of the maximum number of tables
that ever existed simultaneously on a Wikipedia article. The vast
majority of Wikipedia articles contain only a few tables (we omitted
the even larger number of pages that do not contain any tables at
all). On the other hand, most tables appear on pages together with
other tables. Only 19.1% of all tables appear alone on a Wikipedia
article. The many tables that exist in the vicinity of each other can
be assumed to be related in terms of content.</p>
      <p>On Wikipedia, every article can link to categories, which are
used to group related articles to a topic and can themselves be
organized in categories. We investigate how the creation dates of
tables correlate with any year mentioned in these page categories
(such as 2020 for “2020 United States presidential election”), which
we assume to be the relevant years for that table. Figure 4 shows that
the extracted years and the creation year match for most tables. For
every mention of a year in the page categories, a table is counted in
a cell that represents the month of creation (on the x-axis) and the
mentioned year (y-axis). If those two dimensions would perfectly
align, we would only see marks close to the diagonal of the plot.
There is a tendency that tables are rather created in the second half
of the mentioned year or in the beginning of the following year,
which shows as a small shift to the right in the plot. For those years
that are covered by our dataset (2004–2019), in 50.8% of all cases the
2Source:
https://stats.wikimedia.org/#/en.wikipedia.org/contributing/newpages/normal|line|2001-01-01~2020-10-01|page_type~content|monthly
lse 00k
b 4k
ta 0
fo 30k
r 0
mentioned year and the year of creation align, for 5.6% of the cases
the tables are created before the mentioned years and in 43.5% of
the cases, the tables are created after the mentioned year.</p>
    </sec>
    <sec id="sec-5">
      <title>4.2 Who creates tables?</title>
      <p>The distribution of the number of tables that a user creates is shown
in Figure 5. Only 13.1% of the tables are created by non-registered
users. The figure also clearly shows that tables are more likely to
be created by power-users: More than half of the tables are created
by users who each have also created at least 128 other tables. The
record for the highest number of tables created by a single user is
20,194 (in this case on a variety of sports events).</p>
      <p>A possible explanation for this behavior could be that the efort
and skill it takes to create a new table is too high for many users.
On the other hand, there are very dedicated users who must have
acquired the necessary skills and possibly tools to create thousands
of tables. One insight that we can take from this observation is that
any random sample of tables is likely to be influenced by those
power users.</p>
      <p>5 10
Table age [years]
15
Creating tables is a tedious job, especially for inexperienced users,
who might not be familiar with the syntax of tables. An obvious
hypothesis is therefore that users copy &amp; paste similar tables (created
by other users or themselves) and adapt them according to their
needs. To investigate this hypothesis, we studied the frequency
with which the same content appears in the first version of
diferent tables. For a more accurate picture, we also analyzed how many
users chose to use exactly the same table markup code, presumably
as table templates.</p>
      <p>The first observation we made is that 3,004,883 of the 3,471,609
tables (86.6%) in our corpus appear to have unique first versions.
This does not imply that they are not copied from somewhere else,
but were modified prior to the first save of the page. On the other
end of the spectrum, there are templates that are used more than
15,000 times to create tables.</p>
      <p>As can be seen in Figure 6, the ratio of tables and number of
users that use the same template greatly varies. Some templates
are used for thousands of tables, but also by thousands of diferent
users (top right in the plot). This is usually the case for example
tables that contain only dummy values (an example can be seen in
Figure 7a). However, there are other templates that are also used for
thousands of tables, but only by a few dozen users (top left in the
(a) A general table template
(b) A sport specific table template
2005</p>
      <p>2010 2015
Snapshot date</p>
      <p>2020
to be deleted
unchanged
to be updated
to be created
plot). These are usually domain-specific templates, such as tables
for sports results or statistics (see Figure 7b for an example).
5</p>
    </sec>
    <sec id="sec-6">
      <title>EVOLUTION</title>
      <p>The second phase in the lifetime of a table is its evolution. This
phase encompasses all changes that happen between the initial
creation and the possible final deletion, including changes to data,
to schema, and to shape.
5.1</p>
    </sec>
    <sec id="sec-7">
      <title>How often are tables changed?</title>
      <p>The average table in our corpus is re-inserted 0.62 times, deleted
0.93 times, and updated 13.89 times. Of the 0.62 re-inserts, 0.10 are
fresh table versions, i.e., the table’s content is diferent from any
previous version, which means 0.52 of the inserts restore previously
existing table versions that were deleted at some prior point in time.
For the 13.89 updates, the ratio of fresh and old versions is 11.97
fresh versus 1.92 updates that restore previously existing versions.
While these average numbers seem quite low, there is a large skew:
there is a table on social networking websites that was updated
more than 10,000 times during its lifetime. At least 1,310 tables were
each updated more than 1,000 times during their lifetimes.</p>
      <p>Figure 8 shows the number of tables that would have been
created/updated/deleted by the date of our analyzed snapshot
(September 1, 2019), if the snapshot were taken at a previous point in
time (shown on the x-axis). In a one-month-old snapshot, already
4.4% of tables are outdated. If the snapshot were taken a year earlier,
26.6% of the tables would no longer represent the current state. In
a 5 years time range, this number rises to 60.6%.</p>
      <p>The violin plot in Figure 9 shows how the change frequency
behaves with increasing table age. The shape of a violin plot follows
the distribution of the values: the wider the line, the more probable
the value. Within the violin plot, there are marks at the 0.25/0.5/0.75
quantiles. In particular, it shows how the time since the last update
is distributed for tables at diferent ages. The median rises until a
certain point, after which it stays constant or slightly decreases
again. However, the distribution is skewed towards the two ends
of the spectrum: tables either are very frequently updated or are
hardly ever changed. For example, considering the quantiles for the
5-year-old tables, more than 25% of these tables were updated in
200 400</p>
      <p>Previous position
the last year and for another 25% the last update was almost more
than four years ago.</p>
    </sec>
    <sec id="sec-8">
      <title>5.2 Who changes tables?</title>
      <p>Figure 10 shows how long the original creator is active as an updater
of a table. We distinguish between registered and unregistered
creators, because for unregistered creators we have only the IP address
as an identifier, which might change from time to time. Therefore,
it is not too surprising that the share of edits that is done by an
unregistered creator quickly drops and, hence, we exclude tables
created by unregistered users from this plot. On the other hand,
for registered users, there are tables that are still updated by the
original creators years after they have been created. In reality, the
influence of the original author on a table could be even higher than
what this plot suggests: the number of edits for a table decreases
over time, so the first buckets contain more edits.</p>
      <p>When we look at the number of editors that change individual
tables in Figure 11, we see that a large number of tables (35%) are
updated by the creator of the table only. In this analysis, we only
consider users as editors of a table who create a new version (a
simple revert to a previous version is not counted). While most of
the tables are updated by only a few users, there are some
exceptions where thousands of users contribute to the table. Again, the
previously mentioned table on social networking websites holds
the record with contributions by 4235 distinct users.</p>
    </sec>
    <sec id="sec-9">
      <title>5.3 Are tables moved?</title>
      <p>Figure 12 shows how much tables move in relation to other tables
on their page. While for most page revisions, the tables do not
move or move only slightly, there are page revisions for which
tables move by up to 1,574 positions for a single page (we removed
this one extreme case as an outlier). We observe that if tables move,
this is often due to the insertion or deletion of tables and that tables
rather move down on the page (64.09%) than up (35.91%). One
obvious reason for this imbalance is the fact that a table inserted
in the middle of the page causes other tables to move down, and
insertions are more common than deletions. On average, a table’s
position changes 1.66 times during its lifespan.</p>
    </sec>
    <sec id="sec-10">
      <title>5.4 How much do tables change?</title>
      <p>Figure 13 shows how the content of tables develops over time. More
precisely, it shows a similarity score of each table compared to the
ifrst version of that table (calculated on a random subset of 1,000
tables). We use a similarity metric that is based on a word vector
representation of both table versions: sim(®, ® ) = ÍÍ mmainx((,,)) ,
where ® and ® are word vectors of the two table versions that should
be compared. In general, the similarity is expected to decrease over
time, but it can also rise if the table content becomes more similar
to its original version. While there are some tables that stay almost
unchanged throughout their lifetime, there are other tables that
rapidly change within the first few days of their existence. One
reason for this could be that people copy &amp; paste other tables as
templates and then adjust the content, as explained in Section 4.3.</p>
      <p>During their lifetime, 23.6% of all tables either grow or shrink in
the number of columns, 37.0% grow or shrink in the number of rows.
However, 57.3% of all tables retain their original size throughout
their lifetime.</p>
    </sec>
    <sec id="sec-11">
      <title>5.5 How much do schemata change?</title>
      <p>About half of all tables never change their schema, as can be seen
in Figure 14 (note that this is a log-log plot). On the other side,
there are tables that change their schema hundreds of times, up to
443 changes. On average, each table has 1.86 schema versions. The
types of schema change can be manifold. For example, columns are
renamed, columns are added or removed.</p>
      <p>Figure 16 shows a vivid example of how schemata of web tables
evolve over time. To create this plot, we created a clustering of
schemata based on tables that evolve from one schema to another.
This particular plot shows a cluster of schemata that all contain
information about league results of football teams. There are almost
500 tables for which at least one of the snapshots had one of the
Schemata 2–7. More than half of those tables followed Schema 6
at the beginning of 2011, while the other half mostly did not yet
exist (Group 1). The splines show how this schema evolved to
many diferent specializations until 2018. While in some cases these
specializations make sense (such as a clarification about the league
system), in other cases they are due to inconsistent changes (such as
1M
100k
the header “Year”, which after manual inspection should actually be
“Season”, a range spanning two consecutive years, in most cases).
As these tables are webtables, the header can also be formatted
diferently and we can see that for most tables of Schema 7, the
header was changed to bold-type (Schema 6) in between 2009 and
2010. Still, there is a small number of tables that even after almost
a decade still did not make this transition.
6</p>
    </sec>
    <sec id="sec-12">
      <title>DELETION</title>
      <p>Figure 15 shows how long tables survive, counting the days from
their creation. The blue part shows the percentage of tables that
reached the respective age without being deleted. The green part
represents those tables that have been created long enough ago
such that they could have reached the respective age, but were
deleted before reaching that age. 69.5% of all tables ever created
have survived until the end-date of our dataset. If a table is deleted,
then this usually happens at the beginning of its lifetime. The longer
a table exists, the less likely it becomes that it will be deleted.</p>
      <p>From the time a table is created until it is deleted (or until the
end-date of the dataset), the average in our table corpus is 4.93 years.
In 97.7% of that time, the table is truly part of the page, while in the
remaining 40.50 days the table is (temporarily) deleted.</p>
      <p>While the vast majority of tables is never deleted (57.2%) or
deleted only once (29.9%), there is a larger skew in the distribution
of deletes. One table that explains the Wiki syntax was deleted 620
times during its lifetime, mostly from vandalism.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSIONS</title>
      <p>In summary, we have seen how fast tables on Wikipedia change and
how fast they come and go. When working with this corpus, it is
important to keep this additional temporal dimension in mind and
leverage it when possible. The history also makes other dimensions
of the corpus accessible, such as the creators, editors, or templates,
which together provide a perspective on the tables that is more
holistic than single snapshots of individual tables or a table corpus.</p>
      <p>As future work, we plan to explore whether other structured
corpora, such as Wikipedia infoboxes or lists, for which we also provide
histories, behave similarly in terms of their dynamics. Furthermore,
we want to use the gained insights to assign trust to values and
improve data quality. We encourage researchers to explore their
datasets in a similar manner to uncover hidden information in a
dataset’s history.
deleted
alive</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>B. Thomas Adler</surname>
          </string-name>
          , Krishnendu Chatterjee, Luca De Alfaro, Marco Faella, Ian Pye, and
          <string-name>
            <given-names>Vishwanath</given-names>
            <surname>Raman</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Assigning trust to Wikipedia content</article-title>
          .
          <source>In Proceedings of the International Symposium on Wikis (WikiSym)</source>
          .
          <volume>26</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          :
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Enrique</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          , Guillermo Garrido,
          <string-name>
            <surname>Jean-Yves Delort</surname>
            , and
            <given-names>Anselmo</given-names>
          </string-name>
          <string-name>
            <surname>Peñas</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>WHAD: Wikipedia historical attributes data - Historical structured data extraction and vandalism detection from the Wikipedia edit history</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>47</volume>
          ,
          <issue>4</issue>
          (
          <year>2013</year>
          ),
          <fpage>1163</fpage>
          -
          <lpage>1190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Bleifuß</surname>
          </string-name>
          , Leon Bornemann, Theodore Johnson, Dmitri V.
          <string-name>
            <surname>Kalashnikov</surname>
            , Felix Naumann, and
            <given-names>Divesh</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Exploring Change - A New Dimension of Data Analytics</article-title>
          .
          <source>PVLDB 12</source>
          ,
          <issue>2</issue>
          (
          <year>2018</year>
          ),
          <fpage>85</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Bleifuß</surname>
          </string-name>
          , Leon Bornemann, Dmitri V Kalashnikov,
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Divesh</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Structured Object Matching across Web Page Revisions</article-title>
          .
          <source>In Proceedings of the International Conference on Data Engineering (ICDE).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Leon</given-names>
            <surname>Bornemann</surname>
          </string-name>
          , Tobias Bleifuß, Dmitri V Kalashnikov,
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Divesh</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Natural Key Discovery in Wikipedia Tables</article-title>
          .
          <source>In Proceedings of The Web Conference</source>
          <year>2020</year>
          .
          <fpage>2789</fpage>
          -
          <lpage>2795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Andrei Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          , Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Tomkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Janet L.</given-names>
            <surname>Wiener</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Graph structure in the Web</article-title>
          .
          <source>Comput. Networks</source>
          <volume>33</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2000</year>
          ),
          <fpage>309</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.D.</given-names>
            <surname>Broido</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Clauset</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Scale-free networks are rare</article-title>
          .
          <source>Nature Communications</source>
          <volume>10</volume>
          ,
          <issue>1017</issue>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Luciana</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Buriol</surname>
            , Carlos Castillo, Debora Donato,
            <given-names>Stefano</given-names>
            Leonardi, and Stefano
          </string-name>
          <string-name>
            <surname>Millozzi</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Temporal analysis of the wikigraph</article-title>
          .
          <source>In International Conference on Web Intelligence (WI)</source>
          .
          <volume>45</volume>
          -
          <fpage>51</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Siarhei</given-names>
            <surname>Bykau</surname>
          </string-name>
          , Flip Korn, Divesh Srivastava, and
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Finegrained controversy detection in Wikipedia</article-title>
          .
          <source>In Proceedings of the International Conference on Data Engineering (ICDE)</source>
          .
          <volume>1573</volume>
          -
          <fpage>1584</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , Alon Halevy,
          <string-name>
            <given-names>Hongrae</given-names>
            <surname>Lee</surname>
          </string-name>
          , Jayant Madhavan, Cong Yu,
          <source>Daisy Zhe Wang, and Eugene Wu</source>
          .
          <year>2018</year>
          .
          <article-title>Ten years of webtables</article-title>
          .
          <source>PVLDB 11</source>
          ,
          <issue>12</issue>
          (
          <year>2018</year>
          ),
          <fpage>2140</fpage>
          -
          <lpage>2149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Andrea</surname>
            <given-names>Ceroni</given-names>
          </string-name>
          , Mihai Georgescu, Ujwal Gadiraju, Kaweh Djafari Naini, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Fisichella</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Information evolution in Wikipedia</article-title>
          .
          <source>In Proceedings of the International Symposium on Open Collaboration (OpenSym)</source>
          .
          <volume>24</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          :
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Julian</surname>
            <given-names>Eberius</given-names>
          </string-name>
          , Maik Thiele, Katrin Braunschweig, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Lehner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Top-k Entity Augmentation Using Consistent Set Covering</article-title>
          .
          <source>In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM)</source>
          .
          <volume>8</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          :
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mihai</surname>
            <given-names>Georgescu</given-names>
          </string-name>
          , Nattiya Kanhabua, Daniel Krause, Wolfgang Nejdl, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Siersdorfer</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Extracting event-related information from article updates in Wikipedia</article-title>
          .
          <source>In Advances in Information Retrieval (ECIR)</source>
          . Springer,
          <fpage>254</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Braden</surname>
            <given-names>Hancock</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hongrae</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Cong</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Generating Titles for Web Tables</article-title>
          .
          <source>In Proceedings of the International World Wide Web Conference (WWW)</source>
          .
          <source>ACM</source>
          ,
          <volume>638</volume>
          -
          <fpage>647</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Aniket</surname>
            <given-names>Kittur</given-names>
          </string-name>
          , Bongwon Suh, Bryan A.
          <string-name>
            <surname>Pendleton</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ed</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chi</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>He says, she says: conflict and coordination in Wikipedia</article-title>
          .
          <source>In Proceedings of the International Conference on Human Factors in Computing Systems (SIGCHI)</source>
          .
          <volume>453</volume>
          -
          <fpage>462</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Flip</surname>
            <given-names>Korn</given-names>
          </string-name>
          , Xuezhi Wang,
          <string-name>
            <surname>You Wu</surname>
            , and
            <given-names>Cong</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Automatically Generating Interesting Facts from Wikipedia Tables</article-title>
          .
          <source>In Proceedings of the International Conference on Management of Data (SIGMOD)</source>
          .
          <volume>349</volume>
          -
          <fpage>361</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Larissa</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Lautert</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marcelo M. Scheidt</surname>
            ,
            <given-names>and Carina F.</given-names>
          </string-name>
          <string-name>
            <surname>Dorneles</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Web Table Taxonomy and Formalization</article-title>
          .
          <source>SIGMOD Record 42</source>
          ,
          <issue>3</issue>
          (
          <year>2013</year>
          ),
          <fpage>28</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Oliver</surname>
            <given-names>Lehmberg</given-names>
          </string-name>
          , Dominique Ritze, Robert Meusel, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A large public corpus of web tables containing time and context metadata</article-title>
          .
          <source>In Proceedings of the International Conference Companion on World Wide Web</source>
          .
          <fpage>75</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Pei</given-names>
            <surname>Li</surname>
          </string-name>
          , Xin Luna Dong, Andrea Maurino, and
          <string-name>
            <given-names>Divesh</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Linking temporal records</article-title>
          .
          <source>PVLDB 4</source>
          ,
          <issue>11</issue>
          (
          <year>2011</year>
          ),
          <fpage>956</fpage>
          -
          <lpage>967</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Mostafa</surname>
            <given-names>Mesgari</given-names>
          </string-name>
          , Chitu Okoli, Mohamad Mehdi, Finn Årup Nielsen, and
          <string-name>
            <given-names>Arto</given-names>
            <surname>Lanamäki</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The sum of all human knowledge: A systematic review of scholarly research on the content of Wikipedia</article-title>
          .
          <source>Journal of the Association for Information Science and Technology 66</source>
          ,
          <issue>2</issue>
          (
          <year>2015</year>
          ),
          <fpage>219</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Pimplikar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sunita</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Answering Table Queries on the Web Using Column Keywords</article-title>
          .
          <source>PVLDB 5</source>
          ,
          <issue>10</issue>
          (
          <year>2012</year>
          ),
          <fpage>908</fpage>
          -
          <lpage>919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Shazia</given-names>
            <surname>Wasim</surname>
          </string-name>
          <string-name>
            <surname>Sadiq</surname>
          </string-name>
          , Tamraparni Dasu, Xin Luna Dong, Juliana Freire,
          <string-name>
            <given-names>Ihab F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          , Sebastian Link,
          <string-name>
            <given-names>Renée J</given-names>
            .
            <surname>Miller</surname>
          </string-name>
          , Felix Naumann,
          <string-name>
            <surname>Xiaofang Zhou</surname>
            , and
            <given-names>Divesh</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Data Quality: The Role of Empiricism</article-title>
          .
          <source>SIGMOD Record 46</source>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Paolo</surname>
            <given-names>Sottovia</given-names>
          </string-name>
          , Matteo Paganelli, Francesco Guerra, and
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Finding Synonymous Attributes in Evolving Wikipedia Infoboxes</article-title>
          .
          <source>In Advances in Databases and Information Systems (ADBIS)</source>
          .
          <volume>169</volume>
          -
          <fpage>185</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Jingjing</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haixun</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Zhongyuan</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kenny</surname>
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Understanding tables on the web</article-title>
          .
          <source>In Proceedings of the International Conference on Conceptual Modeling (ER)</source>
          . Springer,
          <fpage>141</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Ziqi</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Efective and Eficient Semantic Table Interpretation using TableMiner+</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>921</fpage>
          -
          <lpage>957</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>