Towards a Large Corpus of Richly Annotated
      Web Tables for Knowledge Base Population

    Basil Ell? , Sherzod Hakimov? , Philipp Braukmann, Lorenzo Cazzoli, Fabian
    Kaupmann, Amerigo Mancino, Junaid Altaf Memon, Kai Rother, Abhishek
                             Saini, and Philipp Cimiano

                             CIT-EC, Universität Bielefeld
                        {bell, shakimov} @cit-ec.uni-bielefeld.de


        Abstract. Web Table Understanding in the context of Knowledge Base
        Population and the Semantic Web is the task of i) linking the content of
        tables retrieved from the Web to an RDF knowledge base, ii) of building
        hypotheses about the tables’ structures and contents, iii) of extracting
        novel information from these tables, and iv) of adding this new informa-
        tion to a knowledge base. Knowledge Base Population has gained more
        and more interest in the last years due to the increased demand in large
        knowledge graphs which became relevant for Artificial Intelligence appli-
        cations such as Question Answering and Semantic Search.
        In this paper we describe a set of basic tasks which are relevant for Web
        Table Understanding in the mentioned context. These tasks incremen-
        tally enrich a table with hypotheses about the table’s content. In doing
        so, in the case of multiple interpretations, selecting one interpretation
        and thus deciding against other interpretations is avoided as much as
        possible. By postponing these decision, we enable table understanding
        approaches to decide by themselves, thus increasing the usability of the
        annotated table data.
        We present statistics from analyzing and annotating 1.000.000 tables
        from the Web Table Corpus 2015 and make this dataset as well as our
        code available online.1

        Keywords: Information Extraction, Table Interpretation, Corpus Cre-
        ation, Corpus Annotation, Hypothesis Creation


1     Introduction
Large amounts of information are available on the Web. However, most infor-
mation is not processable by machines in a way which would allow machines
to perform semantic search on this content or to answer questions using this
data. Having data represented in the RDF (Resource Description Format) for-
mat would be one possibility towards this goal. Despite the progress made in
?
    Corresponding author
1
    The data is available at http://doi.org/10.4119/unibi/2912802, the code is avail-
    able at https://github.com/isywtu/code, and the website is located at https:
    //isywtu.github.io/website.
the field of Natural Language Understanding, extracting information from tex-
tual documents and representing the content in RDF remains limited due to the
complexity of natural language.
    Besides natural language texts, the Web also contains a plethora of tables
and it might be easier to extract information from tables due to their inherent
structure (e.g., rows in a table may be similar to each other) than from text.
    An example of a table is shown in Table 1. A human possessing general
knowledge could assume that this table is about American politicians, their
dates of birth, and the parties they belong to. That means, humans can use
their knowledge to build hypotheses about the data. Having hypotheses about
some parts of the table enables them to obtain new information from other parts.
    Given that more and more RDF data became available online, e.g., in the
form of knowledge bases such as DBpedia and as Linked Open Data in general,
we have machine-processable information available so that machines can build
hypotheses about the content of tables, develop an understanding of the schema
underlying a table, and add extracted data to knowledge bases. RDF data is
thus applied as a leverage for Information Extraction from Web tables.
    In this paper we present basic tasks that enrich a large subset of tables of an
existing dataset – the WDC Web Table Corpus 2015 (WTC) – with hypotheses
based on DBpedia. Tables enriched with hypotheses created by basic tasks – such
as table normalization or entity linking – allow others to focus on higher-level
tasks of table understanding, such as column understanding, and to investigate
how much information extracted from Web tables could be added to DBpedia.
    Our main contributions are: i) we present eight basic tasks that create hy-
potheses about (parts of) tables. ii) We enrich a corpus of 1,000,000 tables with
hypotheses, thus allowing other researchers to focus on higher-level tasks. An
important aspect is that in the case of multiple possibly disagreeing interpre-
tations, selecting one interpretation and thus deciding against other interpreta-
tions is avoided. By postponing these decisions we enable table understanding
approaches to decide by themselves, thus increasing the usability of the anno-
tated table data. iii) We present statistics about 1 million tables and the results
of applying our tasks on these tables, and iv) we make all annotated data, the
code as well as further statistics and example tables available.
    The remainder of this paper are structured as follows: Section 2 presents
the basic tasks, how they build on the WTC data, and how hypotheses of one
tasks are built on top of hypotheses created by other tasks. Section 3 presents
statistics about the analyzed data, Section 4 discusses related work, and Section 5
concludes the paper.


2   Basic Table Interpretation Tasks

In this section we describe basic table interpretation tasks. We refer to them as
basic, since they create hypotheses which are prerequisites to understanding the
entire table. For example, a basic hypothesis concerns whether a cell is a header
cell or a data cell (see table segmentation, Section 2.4).

                                        2
Table 1. Example of a horizontal table (adapted example from the paper Understand-
ing Tables on the Web by Wang et al. [18]).

           Politician      Surname Date of Birth State   Political Party
           Barack Obama Obama Aug 4, 1961 Illinois -
           George W. Bush Bush     July 6, 1946 Texas Republican
           Hillary Clinton Clinton Oct 26, 1947 -     Democratic


    We created our framework for Web table understanding for the WDC Web
Table Corpus 2015 2 which consists of 233 million tables in JSON format. This
corpus was created from a set of 1.78 billion HTML pages from the Common
Crawl July 2015 Web corpus.3 At the time that corpus was created, several deci-
sions were made by the authors of that corpus: tables that contain tables in their
cells as well as small tables (those consisting of less than two columns or three
rows) were excluded. The set of exclusion criteria contains further constraints
which we do not discuss here. Besides the exclusion of tables, four other aspects
were decided and are annotated in the corpus. We take these annotations as truth
and do not reimplement these tasks but we transform these annotations into hy-
potheses, so that our tasks can build hypotheses on top of these hypotheses. We
refer to these existing tasks as WTC (Web Table Corpora) tasks.
    Hypotheses are represented in JSON format and are organized according to
what part of a table a hypothesis is about: table, row, column, and cell. For
example, the orientation (horizontal or vertical) of a table is a hypothesis about
the table, whereas an entity mentioned in a cell is a hypothesis about a cell.
    Hypothesis generation is an incremental process: tasks can be executed re-
peatedly, thereby adding hypotheses to a table, and hypotheses can be related to
hypotheses added in previous steps. For example, language detection (see Section
2.6) might assume a text to be in German. Based on this hypothesis, given a
cell value 10.11.12, table normalization (Section 2.7) then assumes the string to
be a date in day-month-year format (day=10, month=11, year=2012), whereas
it would assume the string to be a date in month-day-year format (day=11,
month=10, year=2012) in case where the detected language is American En-
glish. Note however, that we allow contradictory hypotheses. In the example
above it could be the case that language detection identifies both American En-
glish and German and thus creates two language hypotheses. Then, for each
language hypothesis the table normalization task creates another hypothesis. As
another example of the incremental process, a table might initially be not ex-
cluded from the corpus. However, after executing entity linking (Section 2.8),
the table might be excluded if too few entities are identified.
    The hierarchy of the tasks is shown in Figure 1. Note that some tasks process
data from the WTC corpus directly (e.g., language detection), other tasks only
process hypotheses created by other tasks (e.g., entity linking processes hypothe-

2
    This corpus is available at http://webdatacommons.org/webtables/
3
    http://commoncrawl.org/2015/08/july-2015-crawl-archive-available/


                                        3
ses created by table normalization). So far, not all hypotheses are processed by
some task, but may be processed by some non-basic tasks is the future.


                                                                  Literal Linking      Table Exclusion

                                                                            Entity Linking

                                                                 Table Normalization

Table Orient. Detection (WTC)   Table Classification (WTC)            Language Detection             Table Segm. (WTC)

                                                                   WTC data


Fig. 1. Hierarchy of basic tasks and the data they process. For example, literal linking
processes hypotheses created by entity linking and table normalization, whereas table
orientation detection processes data from the WTC corpus directly.


2.1     Scheduling Task

Scheduling is the task of deciding which tasks to execute on a table and in which
order to execute the tasks. The scheduling task reads in a table, sends it to one of
the other tasks and retrieves the annotated table. Our scheduler sends the table
data to the tasks in the following order: language detection, table normalization,
entity linking, and literal linking. Note that this is a simple approach and one
could also introduce more complex cyclic orders. After each of these steps, the
table is sent to the table exclusion task. If a table should be excluded, no further
tasks are called upon. Otherwise, the scheduler proceeds in the above order.
After all tasks have been applied to a table or the table has been excluded, the
scheduler stores the table.


2.2     Table Orientation Detection Task (WTC task)

This WTC task identifies the orientation of a table, where the orientation can
either be horizontal (i.e., for some columns it is the case that they stand for
an attribute) or vertical. We translate the WTC annotation into a table-related
hypothesis of type table orientation (thus we do not reimplement their ap-
proach), for example as shown in Figure 2 for the horizontal table in Table 1.

Fig. 2. H0 – A table orientation hypothesis related to the example table (Table 1).
"H0" : {
 "created_by_task": "table orientation detection",
 "hypothesis_type": "table orientation",
    "orientation" : "horizontal",
    "source" : "WTC",
}


                                                             4
2.3   Table Classification Task (WTC task)

Table Classification is the task of classifying a table into one of the four classes
relational table, entity table, matrix table, and layout table, as done by the creators
of the WTC corpus and as described in [4] – thus we do not reimplement their
approach. Examples of these table types can be found on the WTC website. The
example table (Table 1) can be classified as relational table. We translate the
WTC annotation into a table-related hypothesis of type table classification,
for example as shown in Figure 3.

Fig. 3. H1 – A table classification hypothesis related to the example table (Table 1).
"H1" : {
  "created_by_task": "table classification",
  "hypothesis_type": "table classification",
  "classification" : "relational",
  "source" : "WTC",
}

2.4   Table Segmentation Task (WTC task)

Table Segmentation is the task of segmenting a table into header areas and data
areas. Header row detection is already done for the WTC tables. In principle, the
task could go beyond header row detection, since, for example, tables can have
a more complex structure where the first column contains headers, too. We do
not implement the segmentation task but rely on the WTC data. We create the
following row-based or column-based hypotheses as shown in Figure 4. Whether
this hypothesis is column-based or row-based depends on whether it is added to
the list of column-based or row-based hypotheses.

Fig. 4. H2 – A table segmentation hypothesis related to the example table (Table 1).
"H2": {
   "created_by_task": "table_segmentation",
   "hypothesis_name": "table_segmentation",
   "header_row": [0],
   "source": "WTC"
}

2.5   Table Exclusion Task

The purpose of this task is to exclude tables from further processing if it seems
like a table does not contain information that could be added to the knowledge
base or if it is unlikely that valid information can be extracted from the table.
For example, if the table seems to be used for layout purposes in a webpage only
(e.g., for the menu of the page), or if no cell can be linked to an entity in our
knowledge base, such as if a table only consists of numerical values, then the
information probably does not fall into the domain of the knowledge base, or if
the table structure is too complex (e.g., with multiple header rows and columns),
then it is unlikely that we can arrive at a correct understanding.


                                          5
    For this task we rely on the table type classification of WTC. All tables
that are not classified as relational table (see Section 2.3) are excluded. After
executing the entity linking task (see Section 2.8) on a table, the table is excluded
if no entity linking hypotheses were generated.


2.6    Language Detection Task

Language Detection is the task of detecting one or more languages for a given
table. Therefore, raw cell data is analyzed and table-based language hypothe-
ses with confidence values are created. Language information helps to reduce
the complexity of finding the right information for the tasks of table normaliza-
tion, entity linking, and literal linking. It facilitates finding correct formats for
datatypes and unit measures. It also reduces computational costs for searching
indices (i.e., those used by entity linking and literal linking).
    We concatenate the content of all cells into a single string that is used as
input to the language classification tool langdetect,4 which computes the most
probable languages with probabilities for 55 languages. An example hypothesis
is shown in Figure 5. The language tags are ISO_639-1 tags.5

Fig. 5. H3 – A language hypothesis related to the example table (Table 1).
"H3" : {
   "created_by_task" : "language_detection", "lang" : "en",
   "confidence" : 0.9
}


2.7    Table Normalization Task

Table normalization is the task of normalizing values found in cells, such as
transforming different representations of dates into a canonical representation
so that other tasks do not need to take into account various representations.
    Values, such as those representing weights, lengths, volumes, time etc. can
have unit identifiers attached. For each string that appears to be a value followed
by a unit identifier we create a hypothesis that contains both the value and the
base unit identifier separately where the value is converted to the base unit (i.e.,
kg for weights). For example, given values of 10kg, 100g, 34t, these are interpreted
as weights and are converted to kilograms. A particular emphasis is given to the
date representation because dates often occur in tables and multiple formats
are possible (e.g., “4 August 1961”, “4-8-1961”, “Aug 1, 2016”, “August 1, 2016”,
“1961/8/4” or “1961.8.4”). For each date that we detect, we create a hypothesis
that contains the original value we found and also the xsd:date value. Note that
hypotheses created by the language detection task are taken into account. This
allows to test with regular expressions that are specific to a certain language,
such as regular expressions for German date formats.
4
    https://pypi.python.org/pypi/langdetect
5
    https://www.iso.org/iso-639-language-codes.html


                                          6
    For each cell we always create a plain hypothesis that contains the original
value and specifies the datatype xsd:string. The advantage of creating hypotheses
about a cell’s content is that other tasks do not have to look at the original data
anymore but only need to scan for hypotheses. This can be seen in Figure 1
where the entity linking task only needs to process hypotheses created by the
table normalization task. Another relevant aspect of the plain hypothesis: since
our interpretation can go wrong, for example, a cell may contain the value 100
and it can be interpreted as a number or as a string, if we only create the
hypothesis that this cell contains an integer value, then the entity linking task
would ignore this cell, even though the table is about movies and 100 is actually
a movie title. For each language we create another plain hypothesis.
    Figure 6 shows an example of a hypothesis created for the string Aug, 4,
1961 which was found in Table 1 in cell 2−1 (column-row). The types of hy-
potheses that we create are date as shown in Figure 6, value for integer or float
values without unit identifiers, value and unit for values with unit identifier,
and plain.

Fig. 6. A table normalization hypothesis related to the cell 2-1 (column-row) (with
the content Aug, 4, 1961) in the example table (Table 1).
"H4": {
   "created_by_task": "table_normalization",
   "hypothesis_name": "date",
   "based_on_hypotheses": ["H3"]
   "data_type": "xsd:date",
   "original_value": "Aug, 4, 1961",
   "value": "1961-08-04",
   "year": 1961,
   "month": 8,
   "day": 4,
   "refers_to_row": 1,
   "refers_to_column": 2,
}


2.8   Entity Linking Task

This task links strings found in cells to DBpedia. For each type of resource (en-
tity, property, class) and for each language we create another index of strings
which are the names of the respective resources according to DBpedia. For prop-
erties and classes these are given via the property rdfs:label. For entities we
use the same methodology as NERFGUN as described in [6] to create the index.
    Given the value of plain hypotheses created by table normalization, we check
the three indexes related to the language the hypothesis is based on for all
resources this value could refer to and order the results by their frequency. For
the top 10 entities, properties, and classes we create cell-based hypotheses.
    For example, given the table normalization hypothesis shown in Figure 7
created for the string Obama found in the example table (Table 1), we create
the hypothesis shown in Figure 8. The hypotheses contain the type, the URI,


                                        7
and the confidence value (which is the frequency value normalized by the sum
of frequency values of all candidates) of the resource.
Fig. 7. H5 – A table normalization hypothesis related to the cell 1-1 (column-row) in
the example table (Table 1).
"H5": {
   "created_by_task": "table_normalization",
   "hypothesis_name": "plain",
   "based_on_hypotheses": ["H3"]
   "value": "Obama",
   "refers_to_column": 1,
   "refers_to_row": 1,
}
Fig. 8. H6 – An entity linking hypothesis related to the example table (Table 1).
"H6": {
   "created_by_task": "entity_linking",
   "hypothesis_type": "resource",
   "based_on_hypotheses": ["H5"],
   "entity": "dbr:Barack_Obama",
    "confidence": 0.67
}

2.9   Literal Linking Task
This tasks links strings found in cells to entities identified in other cells of the
same row, thereby also identifying the property that links the entity with the
literal value. That means, instead of linking strings to entities as done by entity
linking, strings are linked to literal values in DBpedia. Therefore, it processes
hypotheses created by table normalization (all types of hypotheses created by
that task) and entity linking. For example, given Table 1, given the table normal-
ization hypothesis shown in Figure 6 (which expresses that the string 4 August
1961 represents the literal "1961-08-04"ˆˆxsd:date), and given the entity link-
ing hypothesis shown in Figure 8 (which expresses that the string Barack Obama
can be linked to the entity dbr:Barack_Obama), the hypothesis shown in Figure
9 is created. It expresses that the date is the birth date of Barack Obama.
    This task makes use of an index for each language (containing language-
tagged strings as well as datatyped-literals from the respective language version
of DBpedia) to quickly retrieve a set of properties given an entity, a literal, and
a datatype. When building this index, all properties that are used when building
the entity linking indexes are ignored to avoid the creation of hypotheses similar
to those created by entity linking.
Fig. 9. H7 – A literal linking hypothesis related to the example table (Table 1).
"H7": {
   "created_by_task": "literal_linking",
   "based_on_hypotheses": ["H3","H6"],
   "modified_literal": "1961-08-04",
   "property": "dbo:birthDate"
}


                                          8
3   Statistics from Analyzing 1.000.000 Web Tables
From the 99 tar archives of the WTC dataset we select the first 62, 500 tables
from each of the first 16 archives, thus resulting in a set of 1, 000, 000 tables.
     In a complete run over the corpus with the language detection task only, we
found one of the following five languages (English, German, Catalan, French,
Spanish) in most tables. Currently we exclude tables in other languages after
the language detection step. These languages appear in the language statistics
but not in statistics of later tasks such as table normalization and entity linking.
     Note that in this section we do not evaluate the correctness of the hypotheses
created. Rather, we provide data that might tell us something about the nature
of the data and our approaches. For example, it is interesting to know how
many tables exist where no hypotheses were added or where for a cell multiple
literal linking hypotheses were created that relate the literal to multiple entities
detected in other cells. For example, a value could be the birth date of one entity
as well as the foundation date of another entity. Interesting tables can then be
analyzed manually to check whether the tasks need to be improved or to devise
more advanced tasks, such as triplification.
     Scheduling Task: We measured computation time and the number of hypothe-
ses created by each task. The complete processing of a table took an average of
0.9s (±116.1s). Average values for other tasks are: 0.00005s (±0.001s) for ta-
ble exclusion, 0.014s (±0.015s) for language detection, 0.003s (±0.2s) for table
normalization, 0.005s (±0.36s) for entity linking, and 3.2s (±213.5s) for literal
linking. Tasks that transform WTC data into hypotheses such as table clas-
sification, orientation detection, and table segmentation are performed by the
scheduler and were not measured individually.
     The processing of 1, 000, 000 tables took 266h. Table exclusion took 1min,
language detection took 1.3h, table normalization took 14min, entity linking
took 31min and literal linking took 264h. Per table, language detection created
an average of 1.1 hypotheses (±0.4), table normalization created an average of
220 hypotheses (±1360), entity linking created an average of 270.6 hypotheses
(±2117), and literal linking created an average of 0.01 hypotheses (±0.4). For
the tasks orientation detection, table classification, and table segmentation, for
each table one hypothesis was created. Relative to the number of cells in a table,
table normalization created 2.3 hypotheses (±1.02), entity linking created 2.7
hypotheses (±2.77), and literal linking created 0.00007 hypotheses (±0.002).
     Table Segmentation: In our sample of 1, 000, 000 tables, we detected headers
for 332, 676 tables. Note that header detection happened after the exclusion
based on table type. Therefore, the 332, 676 tables with a header account for
95% of the tables that were not excluded.
     Table Exclusion: 650,716 out of 1,000,000 tables were excluded because of the
table type. There were no occurrences of exclusion after language detection or
entity linking. The remaining 349,284 tables went through all processing steps.
     Language Detection: From all the tables that were not excluded, English
was detected for 321,066 tables (91.9%). The next most frequently occurring
languages were German (20,464 / 5.8%), Catalan (7,248 / 2.1%), French (5,568

                                         9
/ 1.6%), and Spanish (4697 / 1.3%). While for most tables we detected only
one language, for 11.2% of tables we detected multiple languages. The most
frequent combination is English and German, accounting for 32.4% of tables
with multiple languages. Other frequent combinations are English and Catalan
(7.5%) and English and French (3.1%).
    Table Normalization: The task considered 316, 006 Web tables and generated
at least one hypothesis on 316, 006 tables (100%), due to the plain hypothesis. We
created a total of 35, 433, 908 hypotheses (including the plain hypotheses) with
an average of 1.12 hypotheses created per cell. 572 hypotheses are related to kg,
1, 587 to km, 6, 378, 332 are hypotheses on integers and 693, 481 are hypotheses
on floats. 839, 459 is the total number of hypotheses generated for dates, which
is split into 367, 836 for the Mon DD, YYYY form, 62, 656 for the YYYY-MM-D form,
139, 659 for the DD.MM.YYYY form, and 269, 308 for the MM.DD.YYYY form.
    Entity Linking: We measured how many entity linking hypotheses (distin-
guished between entities, classes, and properties) were created on average per
table, row, column, and cell. The results are shown in Table 3.
Table 2. Average number of entity linking hypotheses per table/row/column/cell dis-
tinguished by type (entity, class, property).

                                 entities classes properties
                         table  271.74 0.41      1.04
                         row     18.66 0.03      0.07
                         column 50.70 0.08       0.19
                         cell     3.21 0.005     0.01

    Literal Linking: The task processes only tables with at least one entity link-
ing hypothesis. For these tables, on average 0.08 literals per row were linked
and in average 0% of the literals are related to at least two different entities. At
least one hypothesis was generated for 556 tables (0.2%). We found that the top
5 properties that link literals to entities are dbp:dateOfBirth, dbp:birthDate,
dbo:percentageOfAreaWater, dbp:released, and dbo:birthDate. For all these
properties the object is either a numerical value or a date. Our current imple-
mentation might be too restrictive to match strings.


4   Related Work

In this section we discuss works related to the tasks that we described. Often, re-
lated approaches address more than one task and also address higher-level tasks,
such as [10, 11] which present an approach for joint inference using Probabilistic
Graphical Models and detect relations between columns. [15] extracts informa-
tion from Wikipedia tables to populate a knowledge base. Their approach is
based on extracting binary relations between entities.
    Table Orientation Detection: Orientation detection in [14] is based on the
intuition that if rows are similar to each other, then the orientation is vertical
and if columns are similar to each other, then orientation is horizontal. The


                                        10
authors introduce a distance metric for cells and use it to define distance metrics
for rows and columns.
    Table Classification: In [19] the authors distinguish between genuine tables
where a two dimensional grid is semantically significant in conveying the logical
relations among the cells and non-genuine tables. They define a set of features
(layout features, content type features, and word group features) and experiment
with decision tree classification and SVM. [3] propose a fine-grained taxonomy of
HTML tables that contain relational knowledge. The table types were manually
created after inspecting a set of tables and the authors analyzed the distribution
of these types by manually classifying tables via statistics features, cell features,
layout features, predicate features, and score features with which they train a
classifier. The authors of [9] classify tables into the classes Genuine Table with
Header, Genuine Table without Header, and Non-genuine Table.
    Table Segmentation: [17, 5, 13] introduce Minimum Indexing Point Search
and identify row and column headers by locating the minimum indexing point
of a table. [2] analyze Web spreadsheets and perform a CRF-based segmentation.
    Table Normalization: To detect the content of a cell, in [7] regular expres-
sions are used. In [16] data types (i.e., string, numeric values, time-stamps, and
coordinates) are detected via regular expressions. Handcrafted transformation
rules are used to transform abbreviations, e.g., Co. to Company.
    Entity Linking: [12] uses DBpedia as a knowledge base to map cell values.
They built classifiers that pick the most likely entity from top N candidates.
Similar to their work, we use DBpedia as a knowledge base to find the candidate
entities for a cell value. Authors in [1] present an approach for linking the cell
values to YAGO entities using an Iterative Classification Algorithm [8].
    Literal Linking: [15] extract relations between pairs of cells and classify them
into the relation between the two cells and the connection between the cell types
and the possible relations for those.


5   Conclusions and Future Work

In this paper we present basic tasks that enrich a large subset of tables of an
existing dataset with hypotheses based on DBpedia, we present statistics about
the data and the hypotheses and make the corpus available to the community.
We believe that this data allows others to focus on higher-level tasks of table
understanding such as column understanding and to investigate how much in-
formation extracted from Web tables could be added to DBpedia.
    Given that the WTC corpus contains 233 million tables and so far we an-
notated only one million tables, for future work we plan to annotate the entire
corpus and support more than 5 languages by creating more regular expressions
to detect and normalize values (e.g., for currencies and time measurements),
and to develop tasks that build row-based and column-based hypotheses based
on the cell-related hypotheses, such as for table orientation detection and table
segmentation. More files will be available from our website in the future.

                                         11
Acknowledgements
This work was supported by the Cluster of Excellence Cognitive Interaction
Technology ’CITEC’ (EXC 277) at Bielefeld University, which is funded by the
German Research Foundation (DFG). The work was partially created within the
Intelligent Systems Master students project Information Extraction from Web
Tables at Bielefeld University under the supervision of B. Ell and S. Hakimov.


References
 1. C. S. Bhagavatula, T. Noraset, and D. Downey. TabEL: Entity Linking in Web
    Tables. ISWC ’15, pages 425–441. Springer, 2015.
 2. Z. Chen and M. Cafarella. Automatic Web Spreadsheet Data Extraction. SSW
    ’13, pages 1:1–1:8. ACM, 2013.
 3. E. Crestan and P. Pantel. Web-scale Table Census and Classification. WSDM ’11,
    pages 545–554. ACM, 2011.
 4. J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner.
    Building the Dresden Web Table Corpus: A Classification Approach. BDC ’15.
 5. D. W. Embley, S. Seth, and G. Nagy. Transforming Web Tables to a Relational
    Database. ICPR ’14, pages 2781–2786, Aug 2014.
 6. S. Hakimov, H. t. Horst, S. Jebbara, M. Hartung, and P. Cimiano. Combining
    Textual and Graph-Based Features for Named Entity Disambiguation Using Undi-
    rected Probabilistic Graphical Models. EKAW ’16, pages 288–302. Springer, 2016.
 7. W. Holzinger, B. Krüpl, and M. Herzog. Using Ontologies for Extracting Product
    Features from Web Pages. ISWC ’06, pages 286–299. Springer, 2006.
 8. Q. Lu and L. Getoor. Link-based Classification. ICML ’03, pages 496–503, 2003.
 9. W. Lu, Z. Zhang, R. Lou, H. Dai, S. Yang, and B. Wei. Mining RDF from Tables
    in Chinese Encyclopedias. NLPCC ’15, pages 285–298, 2015.
10. V. Mulwad, T. Finin, and A. Joshi. Automatically Generating Government Linked
    Data from Tables. In AAAI Fall Symposium, volume 4, 2011.
11. V. Mulwad, T. Finin, and A. Joshi. Semantic Message Passing for Generating
    Linked Data from Tables. ISWC ’13, pages 363–378. Springer, 2013.
12. V. Mulwad, T. Finin, Z. Syed, and A. Joshi. Using linked data to interpret tables.
    COLD ’10, pages 109–120. CEUR-WS.org, 2010.
13. G. Nagy, S. Seth, and D. W. Embley. End-to-End Conversion of HTML Tables for
    Populating a Relational Database. IAPR ’14, pages 222–226, April 2014.
14. A. Pivk, Y. Sure, P. Cimiano, M. Gams, V. Rajkovic, and R. Studer. Transforming
    Arbitrary Tables into F-Logic Frames with TARTAR. DKE, 60(3):567–595, 2007.
15. C. Ran, W. Shen, J. Wang, and X. Zhu. Domain-Specific Knowledge Base Enrich-
    ment Using Wikipedia Tables. ICDM ’15, pages 349–358. IEEE, 2015.
16. D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. WIMS
    ’15, 2015.
17. S. Seth and G. Nagy. Segmenting Tables via Indexing of Value Cells by Table
    Headers. ICDAR ’13, pages 887–891, Aug 2013.
18. J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding Tables on the Web.
    ER ’12, pages 141–155. Springer, 2012.
19. Y. Wang and J. Hu. Automatic Table Detection in HTML Documents. Series in
    Machine Perception and Artificial Intelligence, 55:135–154, 2003.


                                         12