<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Disambiguating Web Tables using Partial Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ziqi Zhang</string-name>
          <email>z.zhang@dcs.shef.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Sheffield</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work addresses disambiguating Web tables - annotating content cells with named entities and table columns with semantic type information. Contrary to state-of-the-art that builds features based on the entire table content, this work uses a method that starts by annotating table columns using automatically selected partial data (i.e., a sample), then using the type information to guide content cell disambiguation. Different sample selection methods are introduced and tested to show that they contribute to higher accuracy in cell disambiguation, comparable accuracy in column type annotation with reduced computation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Enabling machines to effectively and efficiently access the increasing amount of
tabular data on the Web remains a major challenge to the Semantic Web, as the classic
indexing, search and NLP techniques fail to address the underlying semantics carried
by tabular structures [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This has sparked increasing interest in research on
semantic Table Interpretation, which deals with semantically annotating tabular data such as
shown in Figure 1. This work focuses specifically on annotating table columns that
contain named entity mentions with semantic type information (column classification),
and linking content cells in these columns with named entities from knowledge bases
(cell disambiguation). Existing work follows a typical workflow involving 1) retrieving
candidates (e.g., named entities, concepts) from the knowledge base, 2) constructing
features of candidates, and 3) applying inference to choose the best candidates. One
key limitation is that they adopt an exhaustive strategy to build the candidate space for
inference. In particular, annotating table columns depends on candidate entities from
all cells in the column [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, for human cognition this is unnecessary. For
example, one does not need to read the entire table shown in Figure 1 - which may contain
over a hundred rows - to label the three columns. Being able to make such inference
using partial (as opposed to the entire table) or sample data can improve the efficiency
of the task as the first two phases can cost up to 99% of computation time [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Sample driven Table Interpretation opens up several challenges. The first is defining
a sample with respect to each task. The second is determining the optimal size of the
sample with respect to varying sizes of tables. The third is choosing the optimal sample
entries, since a skewed sample may damage accuracy. Our previous work in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has
proposed TableMiner to address the first two challenges. This work adapts TableMiner
to explore the third challenge. A number of sample selection techniques are introduced
and experiments show that they can further improve cell disambiguation accuracy and
in the column type annotation task, contribute to reduction in computation with
comparable learning accuracy.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>An increasing number of work has</title>
        <p>
          been carried out in semantic Table
Interpretation, such as Venetis et
al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] that uses a maximum
likelihood model, Limaye et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that
uses a joint inference model, and
Mulwad et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] that uses joint
inference with semantic message
passing. These methods differ in
terms of the inference models,
features and background knowledge Fig. 1. Lakes in Central Greece
bases used. All these methods are, as discussed earlier, ‘exhaustive’ as they require
features built based on all content cells in order to annotate table columns. Zwicklbauer
et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is the first method that annotates a table column using a sample of the column.
However, the sample is arbitrarily chosen.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        TableMiner is previously described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It disambiguates named entity columns in a
table in two phases. The first phase creates preliminary annotations by using a sample
of a column to classify the column in an iterative, incremental algorithm shown in
Algorithm 1. In each iteration, a content cell Ti;j drawn from a column Tj is disambiguated
(output Ei;j ). Then the concepts associated with the winning entity are gathered to
create a set of candidate concepts for the column, Cj . Candidate concepts are scored and
their score can change at each iteration due to newly disambiguated content cells adding
re-enforcing evidence. At the end of each iteration, Cj from the current iteration is
compared with the previous. If scores of candidate concepts are little changed (convergence,
see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for a method for detection), then column classification is considered to be stable
and the highest scoring candidates are (Cj+) chosen to annotate the column. The second
phase begins by disambiguating the remaining cells (part I), this time using the type
information for the column to limit candidate entity space to those belonging to the type
only. This may revise Cj for the column, either adding new elements, or resetting scores
of existing ones and possibly causing the winning concept for the column to change.
In this case, the next part of the second phase (part II) repeats the disambiguation and
classification operations on the entire column, while using the new Cj+ as constraints to
restrict candidate entity space. This procedure repeats until Cj+ and the winning entity
in each cell stabilizes (i.e., no change).
      </p>
      <p>
        Modified TableMiner For the purpose of this study, TableMiner is modified (T Mmod)
to contain only the first phase and part I of the second phase. In other words, we do not
revise the column classification results obtained from sample data. Therefore T Mmod
may only use a fraction of a column’s data to classify the column, which reduces
computation overhead compared to classic ‘exhaustive’ methods.
Sample selection The choice of the sam- Algorithm 1 Sample based classification
ple can affect learning in T Mmod in two
ways. While the size of the sample is 1: Input: Tj; Cj ← ∅
dealt with by the convergence measure 32:: forparlelvcCeljl T←i;jCinj Tj do
described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], here we address the is- 4: Ei;j ←disambiguate(Ti;j )
sue of selecting the suitable sample en- 5: Cj ←updateclass(Cj, Ei;j )
tries to ensure learning accuracy. Since 6: if convergence(Cj, prevCj) then
column classification depends on the dis- 7: break
ambiguated cells in the sample, we hy- 8: end if
pothesize that high accuracy of cell dis- 9: end for
ambiguation contributes to high accuracy
in column classification. And we further hypothesize that higher accuracy of content
cell disambiguation can be achieved by 1) richer feature representation, and 2) less
ambiguous names (i.e., if a name is used by only one or very few named entities).
Therefore, we propose three methods to compute a score of each content cell in a
column, then rank them by the score before running Algorithm 1 (i.e., input Tj will contain
content cells the order of which is re-arranged based on the scores).
      </p>
      <p>One-sense-per-discourse (ospd) First and foremost, we make the hypothesis of
‘one-sense-per-discourse’ in table context, that if an NE-column is not the subject
column of the table (e.g., the first column in Figure 1 is a subject column), then cells
containing the same text content are extremely likely to express the same meaning1.
Thus to apply ospd we firstly re-arrange cells in a column by putting those containing
duplicate text content adjacent to each other. Next, when disambiguating a content cell,
the feature representation of the cell concatenates the row context of the cell, and that
of any adjacent cells with the same text content (e.g., in Table 1 we assume
‘AetoliaAcarnania’ on the three rows to have the same meaning, and build a single feature
representation by concatenating all the three rows). Effectively this creates a richer feature
representation for cells whose content re-occur across a table.</p>
      <p>Feature size (fs) With fs, we firstly apply ospd, then rank cells in a column by the
size of their feature representation as determined by the number of tokens in a
bag-ofwords representation. This would allow T Mmod to start with cells that potentially have
the largest - hence ‘richest’ - feature representation in Algorithm 1.</p>
      <p>Name length (nl) With nl, we count the number of words in the cell text content to
be disambiguated and rank cells by this number - name length (e.g., in Table 1
‘AetoliaAcarnania’ has two words and will be disambiguated before ‘Boeotia’). nl merely relies
on the name length of a cell content and does not apply ospd. The idea is that longer
names are less likely to be ambiguous.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Conclusion</title>
      <p>We evaluate the proposed methods of sample selection using two datasets: LimayeAll
and Limaye2002. LimayeAll contains over 6000 tables and is used for evaluating
content cell disambiguation. Limaye200 contains a subset 200 tables from LimayeAll with</p>
      <sec id="sec-4-1">
        <title>1 Due to space limitation, details are omitted but can be found in [3, 4] 2 [4], currently under transparent review.</title>
        <p>Cell disambiguation (LimayeAll) Column classification (Limaye200)
T Mmod T M mosopdd T M mfsod T Mmnlod T Mmod T M mosopdd T M mfsod T Mmnlod
0.809 0.812 0.812 0.813 0.723 0.719 0.721 0.723</p>
        <p>
          Table 1. Cell disambiguation and column classification accuracy in F1.
columns manually annotated with Freebase concepts, and used for evaluating column
classification. As a baseline, T Mmod without any sample selection techniques is used.
It simply chooses cells from a column in their original order in Algorithm 1. This is
compared against T M mosopdd, which applies ospd to non-subject NE-columns, preserves
the original order but disambiguates groups of cells containing the same text content;
T M mfsod that applies ospd to non-subject NE-columns then prioritizes cells that
potentially have richer feature representation; and T Mmnlod that prioritizes cells containing
longer text content. Results on both datasets are shown in Table 1. It suggests that,
compared against T Mmod, the sample selection techniques can enhance the accuracy of cell
disambiguation marginally. In the column classification task however, they do not add
benefits in terms of accuracy. By analyzing the computation overhead in terms of the
automatically determined sample size in each table, it shows that the sample selection
techniques have reducing the amount of data to be processed in column classification.
As an example, T Mmod converges on average after processing 58% of cells in a table
column, i.e., it manages to classify a table column using a sample size of 58% of the
total number of cells in that column. T M mosopdd reduces this to 53%, for T M mfsod 52%
and for T Mmnlod 58% (unchanged). This may contribute to noticeable reduction in CPU
time since the construction of feature space (including querying knowledge bases) for
each data unit consumes over 90% of computation time [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To summarize, it has been
shown that, by using sample selection techniques, it is possible to semantically
annotate Web tables in a more efficient way, achieving comparable or even higher learning
accuracy depending on tasks.
        </p>
        <p>Acknowledgement: Part of this work is carried out in the LODIE project (Linked Open
Data Information Extraction), funded by EPSRC (EP/J019488/1).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>3</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>1338</fpage>
          -
          <lpage>1347</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mulwad</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Semantic message passing for generating linked data from tables</article-title>
          .
          <source>In: International Semantic Web Conference (1)</source>
          . pp.
          <fpage>363</fpage>
          -
          <lpage>378</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Venetis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Pas¸ca,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Recovering semantics of tables on the web</article-title>
          .
          <source>Proc. of VLDB Endowment</source>
          <volume>4</volume>
          (
          <issue>9</issue>
          ),
          <fpage>528</fpage>
          -
          <lpage>538</lpage>
          (
          <year>Jun 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Start small, build complete: Effective and efficient semantic table interpretation using tableminer. In: The Semantic Web Journal (under reviewer</article-title>
          , #
          <fpage>668</fpage>
          -
          <lpage>1878</lpage>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Towards efficient and effective semantic table interpretation</article-title>
          . In: To appear in: ISWC2014 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zwicklbauer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Einsiedler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seifert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards disambiguating web tables</article-title>
          .
          <source>In: International Semantic Web Conference (Posters &amp; Demos)</source>
          . pp.
          <fpage>205</fpage>
          -
          <lpage>208</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>