<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Disambiguating Web Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Zwicklbauer</string-name>
          <email>stefan.zwicklbauer@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Einsiedler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Granitzer</string-name>
          <email>michael.granitzer@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christin Seifert</string-name>
          <email>christin.seifert@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Passau Innstrae 33a</institution>
          ,
          <addr-line>94032 Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Web tables comprise a rich source of factual information. However, without semantic annotation of the tables' content the information is not usable for automatic integration and search. We propose a methodology to annotate table headers with semantic type information based on the content of column's cells. In our experiments on 50 tables we achieved an F1 value of 0.55, where the accuracy greatly varies depending on the used ontology. Moreover, we found that for 94% of maximal F1 score only 20 cells (37%) need to be considered on average. Results suggest that for table disambiguation the choice of the ontology needs to be considered and the data input size can be reduced.</p>
      </abstract>
      <kwd-group>
        <kwd>Disambiguation</kwd>
        <kwd>Semantic Enrichment</kwd>
        <kwd>Table Annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Tables on web sites or in scienti c papers represent a valuable source of
information for the human reader. For machines the story is di erent: Although tables
are a structural representation of knowledge, the information itself is
meaningless to machines { unless it is enriched with semantic information. The Semantic
Web, and speci cally the Linked Open Data initiative provide means for
representing any kind of knowledge semantically. If tables were enriched semantically
a variety of new applications could evolve, as is the idea of Google Fusion
Tables [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where the annotation is done by humans. Tables from di erent sources
could be automatically aggregated, compared and used to generate new insights.
      </p>
      <p>
        In this paper we move one step towards fully-automatic semantic table
annotation. We propose a simple algorithm to annotate table headers with semantic
types based on the types of all cells in that column. Further, we investigate the
in uence of the number of cells on the accuracy of the header-type inference.
Current approaches in table annotation pursue a collective approach, i.e., the
annotation model encompasses entities, types and relation between types. The
underlying assumption is that columns and the cells in a column have some
relation in common which is modeled in the semantic knowledge base. Limaye et
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] use a probabilistic graphical model to collectively annotate types (column
headers), entities (cells), and semantic relations between types. They achieved
an F1 score of 0.56 on the Wiki table data set. The authors observed failures
in the annotation if the corresponding correct links between entity types were
not represented in the knowledge base. A recent paper employs a web search
approach for entity classi cation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] using the cell label as search query for a
web search engine and applied text classi cation on the search results. Venetis
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] annotate tables with class information crawled from the web based
on a isA database mined with regular expressions. This web-crawled data base
has a wider coverage than any modeled ontological database, but su eres from
more noise. The authors observe that using a hand-crafted ontology has a higher
precision, but a high coverage is desired for their application of table search. Our
approach di ers in the following: First, we do not assume any relation of columns,
or more speci cally we do not assume that these relations are modeled in the
knowledge base. Second, the only restriction we employ on the entity type is
their availability in the knowledge base.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>The general assumption behind our annotation algorithm is that the cells of a
table column belong to one supertype, which we want to infer. We make no
assumptions of interrelationships between columns, i.e. all columns are treated
separately. Further, we assume that the tables do not have merged cells.</p>
      <p>
        Let li 1 &lt; i n be the labels of non-header cells i and Ei = feikg is the set
of all possible semantic meanings of label li. The set Tik is the set of all type
labels assigned to entity eik. The annotation of table headers is performed in
three steps (compare Figure 1):
Step 1 { Cell entity annotation: For each cell label li we derive a list of k
possible entity candidates Ei using a search-based disambiguation method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
We set k = 10 in our experiments.
      </p>
      <p>Step 2 { Entity-type resolution: For each entity candidate eik in Ei a set of types
is retrieved by following the rdf:type and dcterms:subject relations yielding
the set of types Tik.</p>
      <p>Step 3 { Type aggregation: The types assigned to the table header are the t types
that occur most frequently in the set of all types of all cells Si Sk Tik. We set
t = 1 in our experiments, e.g. only use the most frequent type as result.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Data Set. In our experiments we used dbpedia as knowledge base with type
relations rdf:type and dcterms:subject from the Dublin Core Metadata
Ontology1. We evaluated our approach on 50 tables extracted from Wikipedia pages
1 http://dublincore.org/documents/dcmi-terms/</p>
      <sec id="sec-3-1">
        <title>Step 1</title>
      </sec>
      <sec id="sec-3-2">
        <title>Step 2</title>
      </sec>
      <sec id="sec-3-3">
        <title>Step 3</title>
        <p>
          including all tables mentioned in Limaye et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We removed all columns
containing numbers or complete sentences.2 The number of columns amounts to 132,
the number of rows varies from 10 to 232 (average 54.1). We manually annotated
the tables' columns yielding a total of 329 type annotations, 169 rdf:type and
160 dcterms:subject, averaging to 2.49 annotations per column header.
Overall Performance. We assessed the overall performance on the complete data
set. Table 1 shows the results for three di erent type vocabularies (using
RdfType relations only, using DublinCore subjects only, or using both). In terms of
precision the combined vocabulary performs best (0.64), however only slightly
better than using DublinCore subjects only (0.59), whereas Rdf-type annotations
are worst (0.24). For the combined approach, F1 is low due to the low recall,
which is because we have more correct results in the ground-truth but consider
only the best result in the evaluation.
        </p>
        <p>Table Length. In a second experiment we assessed the in uence of the number
of cells on the accuracy of table header disambiguation. From all 192 columns
we randomly selected k cells for the cell-entity annotation step and assessed
the header disambiguation accuracy using the DublinCore vocabulary. We
repeated the experiment 10 times with di erent randomly selected cells for each
k 2 f1; 2; ::7; 8; 10; 12; 15; 20g. Figure 2 shows precision, recall and F1 measure
averaged over all runs. As expected for small numbers of cells the performance
increases signi cantly when adding one more cell (e.g. from 3 to 4 cells the F1
measure increases from 0.27 to 0.35 a growth of 26%). For larger numbers of
cells there is less information gain by adding one more cell resulting in smaller
increases in performance (all below 10%). Using 20 cells results in F1 of 0.514,
which is 94% of the F1 achieved with all cells (0.547).
2 The data set is available at https://github.com/quhfus/table-disambiguation
5
.
0
4
ce .0
n
raom .03
fr
pe .2
0
1
.
0
0 ●
.
0
●
●
●
●
5
● ● ●
● ● ●</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>We proposed an algorithm to annotate table headers with semantics based on the
types of the column's cells. We achieved similar accuracy as previous work with
more complex methods. We expect the reason for the comparable performance
to be the knowledge base with more exhaustive and qualitative annotations.
From our experiments it seems reasonable to use only a small number of cells
for annotating the header (20 cells lead to 94% of the total achievable accuracy)
if performance is an issue. We plan to exploit more relational knowledge (e.g.
same-as) to further improve the annotations.</p>
      <p>Acknowledgements. The presented work was developed within the CODE project
funded by the EU Seventh Framework Programme, grant agreement number 296150.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shapley</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg-Kidon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Google fusion tables: web-centered data management and collaboration</article-title>
          .
          <source>In: Proc. ACM SIGMOD</source>
          , New York, ACM (
          <year>2010</year>
          )
          <volume>1061</volume>
          {
          <fpage>1066</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>3</volume>
          (
          <issue>1</issue>
          -2) (
          <year>September 2010</year>
          )
          <volume>1338</volume>
          {
          <fpage>1347</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Quercini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynaud</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Entity discovery and annotation in tables</article-title>
          .
          <source>In: Proc. EDBT</source>
          , New York, NY, USA, ACM (
          <year>2013</year>
          )
          <volume>693</volume>
          {
          <fpage>704</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Venetis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasca</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Recovering semantics of tables on the web</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <issue>9</issue>
          ) (6
          <year>2011</year>
          )
          <volume>528</volume>
          {
          <fpage>538</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zwicklbauer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seifert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Do we need entity-centric knowledge bases for entity disambiguation? In: Proc. I-KNOW</article-title>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>