<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using linked data to interpret tables?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Varish Mulwad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Finin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zareen Syed</string-name>
          <email>zareensyed@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anupam Joshi</string-name>
          <email>joshig@cs.umbc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Electrical Engineering University of Maryland</institution>
          ,
          <addr-line>Baltimore County, Baltimore, MD USA 21250</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vast amounts of information is available in structured forms like spreadsheets, database relations, and tables found in documents and on the Web. We describe an approach that uses linked data to interpret such tables and associate their components with nodes in a reference linked data collection. Our proposed framework assigns a class (i.e. type) to table columns, links table cells to entities, and inferred relations between columns to properties. The resulting interpretation can be used to annotate tables, con rm existing facts in the linked data collection, and propose new facts to be added. Our implemented prototype uses DBpedia as the linked data collection and Wikitology for background knowledge. We evaluated its performance using a collection of tables from Google Squared, Wikipedia and the Web.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>linked data</kwd>
        <kwd>human language technology</kwd>
        <kwd>entity linking</kwd>
        <kwd>information retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Resources like Wikipedia and the Semantic Web's linked open data collection
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are now being integrated to provide experimental knowledge bases containing
both general purpose knowledge as well as a host of speci c facts about
significant people, places, organizations, events and many other entities of interest.
The results are nding immediate applications in many areas, including
improving information retrieval, text mining, and information extraction. Still more
structured data is being extracted from text found on the web through several
new research programs [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        We describe a prototype system that automatically interprets and extracts
information from table found on the web. The system interprets such tables using
common linked data knowledge bases, in our case DBpedia, and Wikitology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a
custom hybrid knowledge base for background knowledge. To develop an overall
interpretation of the table, we assign a class to every table column and link every
table cell to an entity from the LOD cloud. We also present preliminary work in
identifying relations between table columns. This interpretation can be used for
variety of tasks; in this paper we describe the task of annotation of web tables
for the Semantic Web. We describe a template used to publish the annotations
as N3. Our implemented prototype was evaluated using a collection of tables
from Google Squared, Wikipedia and tables found on the Web.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Motivation and Related Work</title>
      <p>
        While the availability of data on the Semantic Web has been progressing slowly,
the Web continues to grow at a rapid pace. In July 2008, Google announced that
they had indexed one trillion unique documents on the web1. And much of this
data on the Web is stored in HTML tables. Caferella et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] estimated that
there are around 14.1 billion HTML tables, out of which 154 million contain
high quality relational data.
      </p>
      <p>As a part of the Linked Open Data initiative US, UK and various other
governments have also shared publicly available government data in tabular form.
This represents a huge source of knowledge currently unavailable on the
Semantic Web. There is a need for systems that can automatically generate data
in suitable formats for the Semantic Web from existing sources, be it
unstructured (e.g., free text), semi-structured (e.g., text embedded in forms or Wikis)
or structured (e.g., data in spreadsheets and databases).</p>
      <p>
        Extracting and representing tabular data as RDF is not a new problem.
Signi cant research has been done in the area of mapping relational databases
to RDF; various manual and semi-automatic approaches have been proposed(see
[6{10]). To standardize the mapping language, for mapping relational databases
to RDF and OWL, the W3C has formed a working group RDB2RDF2. On
June 8, 2010 the group published its rst working draft, capturing use-cases and
requirements to map Relational Databases to RDF [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        The other research focus has been on mapping spreadsheets into RDF (see
[
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]). While existing systems like RFD123 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] are practical and useful for
extracting knowledge from tables they su er from several shortcomings. Such
systems require human intervention, for e.g., requiring the users to choose classes
and properties from appropriate ontologies to be used in the annotations. These
systems do not o er any automated (or semi-automated) mechanisms for
associating the columns headers with known classes or linking the entities in
spreadsheets to known entities from the linked data cloud.
      </p>
      <p>While these systems generate triples, since the columns and entities are not
linked, the triples are not much of use to other applications that want to exploit
this data. In certain cases the tripli ed data is as useless as raw data would have
been on the Semantic Web. Just triplifying the data may not be that useful.</p>
      <p>
        While Han et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] focused on the problem of associating possible types
with column headers, it did not focus on a complete interpretation of the table,
nor integrating the table with the linked open data cloud. Limaye et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] do
the exact same sub-tasks as we do, however their goal is answering search queries
1 http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
2 http://www.w3.org/2001/sw/rdb2rdf/
over web tables. The focus of this paper is an automatic framework for
interpreting tables using existing linked data knowledge and using the interpretation
generating linked annotated RDF from web tables for the Semantic Web.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Interpreting a table</title>
      <p>Name Team Position
Michael Jordan Chicago Shooting guard
Allen Iverson Philadelphia Point guard</p>
      <p>Yao Ming Houston Center</p>
      <p>
        Tim Duncan San Antonio Power forward
Consider the table shown in
Figure 1. The column headers
suggest the type of information in
the columns: Name and Team
might match classes in a target
ontology such as DBpedia [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ];
Position could match properties Fig. 1. In this simple table about basketball
in the same or related ontologies. players, the column header represents the type
Examining the data values, which of data stored in columns; values in the columns
are initially just strings, provides represent instances of that type.
additional information that can
con rm some possibilities and disambiguate between possibilities for others. The
strings in column one can be recognized as entity mentions that are instances of
the dbpedia-owl:Person class and can be linked to known entities in the LOD.
Additional analysis can automatically generate a narrower description, such as
dbpedia-owl:BasketballPlayer.
      </p>
      <p>However just examining the string values in the column may not be enough.
Consider the strings in column two. A initial examination of just the strings
would suggest that they may be instances of the dbpedia-owl:PopulatedPlace
class and that the strings should be linked to the respective cities. But in this
case, this analysis would be wrong, since they are referring to NBA basketball
teams.</p>
      <p>Thus it is important to consider additional context evidence, provided by the
column header and rest of the row values. In this example, given the evidence
that values in column one are basketball players and values in column three are
their playing position, we would be able to infer correctly that values in column
two are basketball teams and not cities in the United States.</p>
      <p>Identifying relations between columns will be important as well, since
relations can help identify the columns which can be mapped as properties of some
other column in the table. For example, the values in column three are values
of the property dbpedia-owl:position which can be associated with the players in
column one.</p>
      <p>Producing an overall interpretation of a table is a complex task that requires
developing an overall understanding of the intended meaning of the table as
well as attention to the details of choosing the right URIs to represent both the
schema as well as instances. We break down the process into following tasks:
{ assign every column a class label from an appropriate ontology
{ link table cell values to appropriate LD entities, if possible
{ discover relationships between the table columns and link them to linked
data properties</p>
      <p>In this paper we focus on the rst two tasks - associating type/class label
with a column header and linking table cell to entity. We also present preliminary
work on the discovering relations between columns. We also present how this
interpretation can be used to annotate webtables. The details of our approach
and its prototype implementation are described in Section 4 and the results of
the evaluation are described in Section 5.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Approach</title>
      <p>Our approach comprises four steps: associating ontology classes with columns,
linking cell values to instances of those classes, discovering implicit relations
between columns in the table, and generating annotation output. We discuss
each step in turn.
4.1</p>
      <p>Associating Classes with Columns
In a typical well formed table, each column contains data of a single syntactic
type (e.g., strings) that represent entities or values of a common semantic type
(e.g., people, yearly salary in US dollars). The column's header, if present, may
name or describe the semantic type or perhaps a relation in which the column
participates. Our initial goal is to predict a semantic class from among the
possible classes in our linked data collection that best characterizes the column's
values. Our approach is to map each cell value to a ranked list of classes and
then to select the one which best characterizes the entire column. Algorithm 1
describes the steps involved in the process.</p>
      <p>
        The algorithm rst determines the type or class of each string in the column,
by submitting a complex query to the Wikitology KB. The KB returns a ranked
list of top N instances for each string in the column and their class. Using the
classes of the instances returned by the KB, a set of possible class labels for a
column is generated. Each class label in this set is assigned a score based on
the weighted scoring technique described in algorithm 1. The class labels in the
set are paired with strings in the column and each pair gets scored. The score
is based on the highest ranked instance of the string matching the class being
scored. The score is a weighted sum of the instance's Wikitology rank for the
query and it's approximate page rank [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The class label that maximizes its
score over the entire column is chosen as the class label to be associated with
the column. We predict class labels from four vocabularies - DBpedia Ontology,
Freebase [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], WordNet [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and Yago [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>In the following sections we describe our knowledge base and our custom
query module, used to the query the knowledge base.
Algorithm 1 \PredictClassLabel" - An algorithm to pick the best class to be
associated with a column
1: Let S be the set of k strings in a table column.
2: For each s in S, query the Wikitology KB to get a ranked list of top N possible
Wikipedia instances along with their types or class labels and their predicted page
ranks.
3: From the k N instances, generate a set of class labels that can be associated with
a column. Let C be the set of all associated classes for a column.
4: Create a matrix V [ci; sj] of class label-string pairings where 0 &lt; i &lt; size of (C),
0 &lt; j &lt; size of (S)
5: Assign a score to each V [ci; sj] based on the highest ranking instance that matches
ci. The instance's rank R and its predicted Page Rank is used to assign a weighted
score to V [ci; sj] (we use w = 0.25):</p>
      <p>Score = w (1 / R) + (1 - w) (PageRank)
6: If none of the instances for a string match the class label being evaluated assign
the pair V [ci; sj] a score of 0.
7: Choose the class label ci which maximizes its score over the entire column (S) to
be associated with the column.</p>
      <p>Input: Table Cell Value (String)</p>
      <p>Table Row Data (RowData)</p>
      <p>
        Table Column Header (ColumnHeader)
Output: Top \N" matching instances from KB (TopN)
Query = wikiTitle: String (or)
redirects: String (or)
rstSentence: String, ColumnHeader (or)
types: ColumnHeader (or)
categories: ColumnHeader (or)
contents: (String) ^ 4.0, RowData (or)
linkedConcepts: (String) ^ 4.0, RowData
propertiesValues: RowData
(or)
Knowledge Base. We use DBpedia as our linked data knowledge base. We also
use Wikitology, a hybrid KB of structured and unstructured information
extracted from from Wikipedia augmented with structured information from
DBpedia, Freebase, WordNet and Yago. The interface to Wikitology is via a
specialized information retrieval index implemented using Lucene [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] that allows
unstructured, structured as well as hybrid queries. Given the simple, yet
powerful query mechanism augmented with the fact that the backbone of Wikitology is
Wikipedia, one of the most comprehensive collaborative encyclopedia, we think
Wikitology along with DBpedia are appropriate choices as KBs. The approach
we have described so far and the approaches we describe further are KB
independent, allowing the use of any appropriate and suitable linked data knowledge
bases as and when needed.
Mapping table to Wikipedia. The table cell string that is being queried for along
with its row data and column header are mapped to the various elds of the
Wikitology index. The cell string is mapped to the title, redirects and the rst
sentence elds of the index. If there's a di erence in spelling between the string
in question and the title or a pseudo name is used, it may appear in the redirects.
The column header is mapped to the rst sentence, types and categories, since
the column header of a table often describes the type of instances present in the
column and the type is also likely to appear in the rst sentence as well.
      </p>
      <p>The string (with a Lucene query weight boost of 4.0) along with the row
data is mapped to contents as well as the linked concepts elds. In a table the
data present in a given row are likely to have some kind of relation amongst
themselves, hence we map the row data to the linked concept elds which
captures the linked concepts (article) to a given Wikipedia concept (article). The
row data (excluding the string) is mapped to property values eld, since the row
data can be values of a property associated with the string in question. Figure 2
describes the query. All the elds are\ored " with each other. The query returns
top N instances that the string in the query could be associated with, along with
their types, page length and their approximate PageRank.</p>
      <p>Augmenting types from DBpedia. For every instance returned by Wikitology,
we also query DBpedia using its public SPARQL endpoint3 to fetch the types
associated with that instance on DBpedia. The types returned by Wikitology
are augmented with the types returned by DBpedia to the get a complete and
accurate set of types for a given instance.
4.2</p>
      <p>Linking Table Cells to Entities
We have developed an algorithm \LinkTableCells" (see algorithm 2) to link table
cell strings to entities from the Linked Open Data cloud. For every string in the
table, the algorithm re-queries the KB using the predicted class labels for the
column to which the string belongs to, as additional evidence. The predicted class
labels are mapped to the typesRef eld of the Wikitology index and \anded " in
the query in Figure 2 , thus restricting the type of entities returned by the KB
to the predicted types (class labels).</p>
      <p>
        For each of the top N entities returned by the KB, a feature vector is
generated. The feature vector consists of the entity's index score, entity's Wikipedia
page length, entity's page rank, (all popularity measures) the Levenshtein
distance [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] between the entity and the string in the query and the Dice score
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] between the entity and the string (similarity measures). The Levenshtein
distance and the Dice score is calculated between the query string and all labels
(all possible names) for the entity. To obtain all other possible names, we query
DBpedia to get the values associated with the rdfs:label property of the entity.
The best Levenshtein distance (i.e. the smallest) and the best Dice score (i.e.
the largest) are selected as a part of the feature vector. We choose popularity
      </p>
      <sec id="sec-4-1">
        <title>3 http://dbpedia.org/sparql</title>
        <p>Algorithm 2 \LinkTableCells" - An algorithm to link table cell to entities
1: Let S be the set of strings in a table.
2: for all s in S do
3: Query the KB and get top N instances that the string can be linked to. Let I be
this set of instances for the string.
4: for all i in I do
5: Get all the other names associated with i. Let this set be O
6: Calculate the Levenshtein distance between s and all o 2 O
7: Choose the best (smallest) Levenshtein distance between s and any o 2 O
8: Similarly calculate the Dice score between s and all o 2 O
9: Choose the best (largest) Dice score
10: Create a feature vector for i. The vector includes the following features: i0s
Wikitology index score, i0s page rank, i0s page length, best Levenshtein
distance and best Dice score
11: end for
12: Input feature vectors of all i 2 I to a SVM Rank Classi er. The Classi er outputs
a ranked list of instances in I
13: Select the instance which is top ranked. Let it be topi
14: To feature vector of topi, append two new features - the SVM rank score for the
topi and the di erence of scores between the top two instances ranked by SVM
Rank
15: Input this vector to another classi er which produces a label \yes" or \no" for
the given vector
16: If the classi er labels the feature vector a \yes", link the string s to instance
topi else Link it to NIL.
17: end for
measures as a part of the feature vector because in cases where it is di cult
to disambiguate between entities, the most popular entity is often the correct
answer and choose similarity measures because the entity that will get linked to
query string will be similar if not same in terms of string comparison.</p>
        <p>
          For each query string, a set of feature vectors is generated from the top N
instances returned as query results. A classi er built using SVM-rank [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] ranks
the entities based on the feature vector set. A second classi er is trained to
decide whether the evidence is strong enough to link to the top ranked entity
or not. The classi er decides based on the feature vector of top ranked entity
which now include two additional features - the SVM-rank score of the entity
and the di erence in scores between the top two ranked entities by SVM-rank.
If the evidence is strong enough, the classi er suggests to link to the top ranked
entity, else it suggests to link to \NIL". The above process is repeated for all the
strings in the table.
        </p>
        <p>The second SVM classi er was trained using Weka. In cases where linking to
the top ranked entity returned by SVM-rank based classi er would be incorrect
for example, if the entity is not present in the KB, the second classi er is useful to
determine to link the query string to the entity or to predict a link to \NIL".This
step is useful in discovery of new entities in a given table.
@pre x rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt; .
@pre x dbpedia: &lt;http://dbpedia.org/resource/&gt; .
@pre x dbpedia-owl: &lt;http://dbpedia.org/ontology/&gt; .
@pre x yago: &lt;http://dbpedia.org/class/yago/&gt; .
\Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer .
\Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
\Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .
dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
\Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .</p>
        <p>dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
We have developed a preliminary approach for identifying relations between table
columns. The algorithm generates a set of candidate relations from the relations
that exist between the concepts associated with the strings in each row of the
two columns. To identify relation between the pair of strings, we query DBpedia
using its public SPARQL endpoint.</p>
        <p>Each candidate relation is scored as follows - each pair of strings in the two
columns vote for the candidate relation with a score of 1, if the candidate relation
appears in the set of relations between the the pair of strings. The sum of score of
each of the candidate relation is normalized by the number of rows in the table.
The relation with the highest score is selected to represent relation between the
two columns. Our work on relations identi cation is pretty preliminary and we
have still have to develop approach to identify columns that can be mapped as
properties of some other column.
4.4</p>
        <p>Annotating the webtable
In the previous section (sections 4.1, 4.2 , 4.3) we described approaches that help
us develop an overall interpretation of the table. We now describe how this can
be used for annotating a webtable. We have developed a template for annotating
and representing tables as linked RDF. We choose N3 because it is compact as
well as human readable. Figure 3 shows an example of a N3 representation of
a webtable. To associate the column header with its predicted class label, the
rdfs:label property from RDF Schema is used. The rdfs:label property is also
used to associate the table cell string with its associated entity from DBpedia.
To associate the table string with its type (i.e. class label of the column header),
the rdf:type property is used.
# of Tables 15
# of Rows 199
# of Columns 56 (52)
# of Entities 639 (611)
(a)</p>
        <p>
          Category
Place
Person
Organization
Other types
# of Columns (%) # of Entities (%)
40 45
25 20
12 10
23 25
(b)
Our implemented prototype was evaluated against 15 tables obtained from Google
Squared, Wikipedia and from a collection of tables extracted from the Web4. We
consider simple regular tables with column headers and tables where the number
of cells is equal to the product of number rows and columns. We do not consider
tables which have been used for formatting. The task of assigning class label
to column header was evaluated against 52 columns; linking the table cell to
entity against 611 entities. The distribution of the columns and entities across
the four categories - Persons, Places, Organizations and Other (movies, songs,
nationality etc.) is as shown in Figure 4(b)
We used human judgments to evaluate the correctness of the class labels
predicted by our approach. We evaluated the class label predicted from the DBpedia
ontology, since it would have been fairly easy for our evaluators to browse the
DBpedia ontology. In the rst evaluation of the algorithm for assigning class
labels to columns, we compared the ranked list of possible class labels generated
by the system against the list of possible class labels ranked by the evaluators.
As shown in Figure 5 for 80.76% of the columns the Mean Average Precision
(MAP) [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] between the system and evaluators list is greater than 0 which
indicates that there was at least one relevant label in the top three of the system
ranked list. For 75% of the columns, the recall of the algorithm was greater than
or equal to 0.6. A high recall value shows that there is a high match between
the labels in the top three of the system as compared to the top three list of the
evaluators5.
4 Set of tables available online at www.cs.umbc.edu/ varish1/t2ld-tables/
5 While the top three from the system and evaluator's list were compared; the total length of the
list varied between 5 and 11 which depended upon the set of possible class labels for every column
        </p>
        <p>In the nal evaluation, we tried to assess whether our predicted class labels
were reasonable based on the judgment of human evaluators. Even though a
more accurate class label may exist for a given column, the evaluators needed to
determine whether the predicted class was reasonable. For example, for a column
of cities, a human might judge dbpedia-owl:City as the most appropriate class,
consider dbpedia-owl:PopulatedPlace and dbpedia-owl:Place as acceptable, and
consider other classes as unacceptable (e.g., dbpedia-owl:AdministrativeRegion,
owl:Thing, etc). 76.92% of the class labels predicted were considered correct by
the evaluators. The accuracy in each of the four categories is shown in Figure
6. We enjoyed moderate success in assigning class labels for Organizations and
Other types of data probably because of sparseness of data in the KB about
these types of entities.
5.2</p>
        <p>Linking table cells to entities
For the evaluation of linking table cells to entities, we manually hand-labeled
the 611 table cells to their appropriate Wikipedia / DBpedia pages. The system
generated links were compared against the expected links. 66.12% of the table
cell strings were correctly linked. A look at the breakdown of accuracy based
on the categories (Figure 6) shows that we had the highest accuracy in linking
Persons (83.05%) followed by linking Places (80.43%). We have moderate success
in linking Organization (61.90%), but we fare poorly in linking other types of
data like movies, nationality, songs, types of business and industry etc. with an
accuracy of just 29.22% probably because of sparseness of data in the KB about
these types of entities.</p>
        <p>
          Our dataset had 24 entities which were unknown to the KB and in all the
24 cases, the system was able to predict correctly that the table cell should be
linked to \NIL". Comparing the entity linking results against our previous work
where we used a heuristic based method for linking table cells to entity [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we
have improved our accuracy in linking places by a good margin (18.79 %) and
accuracy for linking persons and organizations slightly decreased (by 7.71 % and
4.77 % respectively). However this would not be that fair a comparison since the
initial results from the previous method are for a small subset6 of the current
data set.
5.3
        </p>
        <p>Relation identi cation
We did a preliminary evaluation for identi cation of relation between columns.
We asked human evaluators to identify pairs of columns in a table between which
a relation may exist and compared that against the pairs of columns identi ed
by the system. For ve tables, used in this evaluation, in 25% of the cases, the
system was able to identify the correct pairs of columns.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We presented an automated framework for interpreting data in a table using
existing Linked Data KBs. Using the interpretation of the table we generate linked
RDF from webtables. Evaluations show that we have been fairly successful in
generating correct interpretation of webtables. Our current work is focused on
improving relationship discovery and generating new facts and knowledge from
tables that contain entities not present in the LOD knowledge bases. To deal
with web scale analytics, we plan to focus on adapting our algorithms for
parallelization using Hadoop or Azure type frameworks. We are also exploring ways
to apply this work to create an automated (or semi-automated / human in the
loop) framework for interpreting and representing public government datasets
as linked data.</p>
      <sec id="sec-5-1">
        <title>6 The subset had 171 entities to link</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The emerging web of linked data</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>24</volume>
          (
          <year>2009</year>
          )
          <volume>87</volume>
          {
          <fpage>92</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Machine reading</article-title>
          .
          <source>In: Proceedings of the National Conference on Arti cial Intelligence</source>
          . Volume
          <volume>21</volume>
          ., Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press;
          <year>1999</year>
          (
          <year>2006</year>
          )
          <fpage>1517</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang</surname>
          </string-name>
          , H.:
          <article-title>Overview of the TAC 2009 knowledge base population track</article-title>
          .
          <source>In: Proceedings of the 2009 Text Analysis Conference</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Syed</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Creating and Exploiting a Web of Semantic Data</article-title>
          .
          <source>In: Proc. 2nd International Conference on Agents and Arti cial Intelligence</source>
          , Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Zhang, Y.:
          <article-title>Webtables: exploring the power of tables on the web</article-title>
          .
          <source>PVLDB 1</source>
          (
          <year>2008</year>
          )
          <volume>538</volume>
          {
          <fpage>549</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Barrasa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <article-title>Gomez-perez, A.: R2o, an extensible and semantically based database-to-ontology mapping language</article-title>
          .
          <source>In: Proc. 2nd Workshop on Semantic Web and Databases(SWDB2004)</source>
          . Volume
          <volume>3372</volume>
          . (
          <year>2004</year>
          )
          <volume>1069</volume>
          {
          <fpage>1070</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Discovering simple mappings between relational database schemas and ontologies</article-title>
          . In Aberer,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.S.</given-names>
            ,
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.F.</given-names>
            ,
            <surname>Allemang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.I.</given-names>
            ,
            <surname>Nixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.J.B.</given-names>
            ,
            <surname>Golbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Mika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Mizoguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cudre-Mauroux</surname>
          </string-name>
          , P., eds.
          <source>: ISWC/ASWC</source>
          . Volume
          <volume>4825</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2007</year>
          )
          <volume>225</volume>
          {
          <fpage>238</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Papapanagiotou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katsiouli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsetsos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anagnostopoulos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadjiefthymiades</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Ronto:
          <article-title>Relational to ontology schema matching</article-title>
          .
          <source>In: AIS SIGSEMIS BULLETIN</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , E.D.:
          <article-title>Composing mappings between schemas using a reference ontology</article-title>
          .
          <source>In: Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE)</source>
          , Springer (
          <year>2004</year>
          )
          <volume>783</volume>
          {
          <fpage>800</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halb</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Idehen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thibodeau</surname>
            Jr,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sequeda</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ezzat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A survey of current approaches for mapping of relational databases to rdf</article-title>
          .
          <source>Technical report, W3C</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feigenbaum</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranker</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fogarolli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sequeda</surname>
          </string-name>
          , J.:
          <article-title>Use cases and requirements for mapping relational databases to RDF, W3C working draft</article-title>
          .
          <source>Technical report</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parr</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sachs</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>RDF123: from Spreadsheets to RDF</article-title>
          . In: Seventh International Semantic Web Conference, Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Langegger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wob</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Xlwrap - querying and integrating arbitrary spreadsheets with sparql</article-title>
          .
          <source>In: 8th International Semantic Web Conference (ISWC2009)</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yesha</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Finding Semantic Web Ontology Terms from Words</article-title>
          .
          <source>In: Proceedings of the Eigth International Semantic Web Conference</source>
          , Springer (
          <year>2009</year>
          )
          <article-title>(poster paper).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB)</source>
          .
          <article-title>(</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dbpedia - a crystallization point for the web of data</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>7</volume>
          (
          <year>2009</year>
          )
          <volume>154</volume>
          {
          <fpage>165</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Syed</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulwad</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Exploiting a Web of Semantic Data for Interpreting Tables</article-title>
          .
          <source>In: Proc. Second Web Science Conference</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Bollacker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sturge</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.:
          <article-title>Freebase: a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. SIGMOD '08</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <volume>1247</volume>
          {
          <fpage>1250</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          :
          <article-title>Wordnet: a lexical database for english</article-title>
          .
          <source>Commun. ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <volume>39</volume>
          {
          <fpage>41</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago: A Core of Semantic Knowledge</article-title>
          .
          <source>In: 16th Int. World Wide Web Conf</source>
          ., New York, ACM Press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Hatcher</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gospodnetic</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Lucene in Action (In Action series</article-title>
          ).
          <source>Manning Publications</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Technical Report 8</source>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mcgill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to Modern Information Retrieval. McGrawHill, Inc</article-title>
          ., New York, NY, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Training linear svms in linear time</article-title>
          .
          <source>In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD '06</source>
          , New York, NY, USA, ACM (
          <year>2006</year>
          )
          <volume>217</volume>
          {
          <fpage>226</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Schutze, H.: Introduction to Information Retrieval.
          <volume>1</volume>
          <fpage>edn</fpage>
          . Cambridge University Press (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>