=Paper=
{{Paper
|id=None
|storemode=property
|title=Generating Linked Data by Inferring the Semantics of Tables
|pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p17-Mulwad.pdf
|volume=Vol-880
|dblpUrl=https://dblp.org/rec/conf/vlds/MulwadFJ11
}}
==Generating Linked Data by Inferring the Semantics of Tables==
Generating Linked Data by Inferring
the Semantics of Tables ∗
Varish Mulwad, Tim Finin and Anupam Joshi
Computer Science and Electrical Engineering
University of Maryland, Baltimore County
Baltimore, Maryland 21250 USA
{varish1, finin, joshi}@cs.umbc.edu
ABSTRACT uments. Cafarella et al. [4] estimated that the Web contains
Vast amounts of information is encoded in structured ta- over 150 million high quality relational tables. In some ways,
bles found in documents, on the Web, and in spreadsheets this information is easier to understand because of its struc-
or databases. Integrating or searching over this information ture but in other ways it is more difficult because it lacks
benefits from understanding its intended meaning. Evidence the normal organization and context of narrative text. Both
for a table’s meaning can be found in its column headers, cell integrating or searching over this information will benefit
values, implicit relations between columns, caption and sur- from a better understanding of its intended meaning.
rounding text but also requires general and domain-specific
background knowledge. We represent a table’s meaning by A wide variety of domains that are interesting both tech-
mapping columns to classes in an appropriate ontology, link- nically and from a business perspective have tabular data.
ing cell values to literal constants, implied measurements, or These include medicine, healthcare, finance, e-science (e.g.,
entities in the linked data cloud (existing or new) and dis- biotechnology), and public policy. Key information in the
covering or and identifying relations between columns. We literature of these domains, which can be very useful for in-
describe techniques grounded in graphical models and prob- forming public policy, is often encoded in tables. As a part
abilistic reasoning to infer meaning (semantics) associated of Open Data and transparency initiative, fourteen nations
with a table. Using background knowledge from the Linked including the United States of America share data and infor-
Open Data cloud, we jointly infer the semantics of column mation on websites like www.data.gov in structured format
headers, table cell values (e.g.,strings and numbers) and re- like CSV, XML. As of May 2011, there are nearly 390,000
lations between columns and represent the inferred meaning raw datasets available. This represents a large source of
as graph of RDF triples. We motivate the value of this knowledge, yet we do not have systems that can understand
approach using tables from the medical domain, discussing and exploit this knowledge.
some of the challenges presented by these tables and describ-
ing techniques to tackle them. Many real world problems and applications can benefit from
exploiting information stored in tables including evidence
Keywords based medical research [11]. Its goal is to judge the efficacy
of drug dosages and treatments by performing meta-analyses
Semantic Web, linked data, human language technology,entity
(i.e systematic reviews) over published literature and clini-
linking, information retrieval
cal trials. The process involves finding appropriate studies,
extracting useful data from them and performing statistical
1. INTRODUCTION analysis over the data to produce a evidence report. Key in-
Most of the information found on the Web consists of text formation required to produce evidence reports include data
written in a conventional style, e.g. as news stories, blogs, such as patient demographics, drug dosage information, dif-
reports, letters, advertisements, etc. There is also a signifi- ferent types of drugs used, brands of the drugs used, number
cant amount of information encoded in structured forms like of patients cured with a particular dosage etc. Most of this
tables and spreadsheets, including stand-alone spreadsheets information is encoded in tables, which are currently beyond
or table as well as tables embedded Web pages or other doc- the scope of regular text processing systems and search en-
∗This research was supported in part by NSF awards 0326460 and gines. This makes the process manual and cumbersome for
0910838,MURI award FA9550-08-1-0265 from AFOSR, and a gift from
medical researchers.
Microsoft Research.
Presently medical researchers perform keyword based search
on systems such as PubMed’s MEDLINE which end up pro-
ducing many irrelevant studies, requiring researchers to man-
Permission to make digital or hard copies of all or part of this work for ually evaluate all of the studies to select the relevant ones.
personal or classroom use is granted without fee provided that copies are Figure 1 obtained from [5] clearly shows the huge difference
not made or distributed for profit or commercial advantage and that copies in number of meta-analysis and number of clinical trials pub-
bear this notice and the full citation on the first page. To copy otherwise, to lished every year. By adding semantics to such tables, we
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at the workshop Very can develop systems that can easily correlate, integrate and
Large Data Search (VLDS) 2011. search over different tables from different studies to be com-
Copyright 2011
not attempt to link the table cell values.
Limaye et al. [9] describe a system based on a graphical
model which maps every column header to a class from
a known ontology, links table cell values to entities from
a knowledge-base and identifies relations between columns.
They rely on Yago [12] for background knowledge.
Current systems for interpreting tables rely on semantically
poor and possibly noisy knowledge-bases. Neither do they
focus on a “complete interpretation” of a table. None of
the current systems propose or generate any form of linked
Figure 1: The number of papers reporting on sys- data from the inferred meaning. A key missing component
tematic reviews and meta-analyses is small com- in current systems is tackling literal constants. The work
pared to those reporting on individual clinical trials, mentioned above will work well with string based tables. To
as shown in this data from MEDLINE. the best of our knowledge, no work has tackled the problem
on interpreting literals in tables and using them as evidence
in the table interpretation framework. To interpret tables
bined for a single meta-analysis. from specialized domains such as medical research will re-
quire incorporating modules that can understand literals.
In this paper, we present techniques to infer the intended
meaning of tables by jointly inferring the semantics of col- Several systems have been implemented to generate Seman-
umn headers, table cell values (e.g., strings and numbers), tic Web data from databases and spreadsheets. Virtually
relations between columns, augmented with background kn- all are manual or semi-automated and none has focused on
owledge from open data sources such as the Linked Open automatically generating linked RDF data. None of the sys-
Data cloud [1]. Our framework maps columns to classes tems or methods proposed above focus on a truly complete
from an appropriate ontology, links cell values to literal con- automated interpretation of a table. Current systems on
stants or entities in the linked data cloud (existing or new) the Semantic Web either require users to specify the map-
and discovers or and identifies relations between columns. ping to translate relational data to RDF or systems that
The interpreted meaning is represented as machine under- do it automatically focus only a part of the table (like col-
standable linked RDF assertions. umn header strings). These systems have mainly focused on
relational databases or simple spreadsheets. The key short-
2. RELATED WORK coming in such systems is that they rely heavily on users
and their knowledge of the Semantic Web. Most systems on
Early work in table understanding focused on extracting ta-
the Semantic Web also do not automatically link classes and
bles from documents and web pages [7, 6]. While progress
entities generated from their mapping to existing resources
has been made in identifying the structure of the table, rel-
on the Semantic Web. The output of such systems turns out
atively little work has been focused on understanding the
to be just “raw string data” represented as RDF, instead of
semantics and meaning associated with tables. Recently,
generating high quality linked RDF.
three groups have focused on understanding the meaning
associated with tables.
The framework we present is complete automated interpre-
tation of a table that focuses on all aspects of a table - col-
Wang et al. [16] use an approach that begins by identifying
umn headers, row values, relations between columns. Our
a single ‘entity column’ in a table and, based on its values
framework will not only tackle strings but also handle liter-
and rest of the column headers, associate a concept with the
als and work across multiple domains - web tables, medical
table. Their work focuses only on identifying the concept to
and open government data.
be associated with the table (i.e., with the “entity column”).
The concepts come from the Probase [17] knowledge base
created from the text on the World Wide Web. Such con-
cepts may not be semantically rich as compared to concepts 3. INTERPRETING A TABLE
from DBpedia or the Linked Open Data cloud. Their work One might be tempted to think that regular text processing
does not attempt to link the table cell values or identify might work with tables as well. After all tables also store
relations between columns. text. However that is not the case. It is said that tables store
information in a “structured form”. It is this very structure
Ventis et al. [15] use framework associating multiple class used to represent the data, that hinders systems from under-
labels (or concepts) with columns in a table. They identify standing the intended meaning of a table. To differentiate
relations between the ‘subject’ column and the rest of the between text processing and table processing consider the
columns in the table. Both the concept identification for the text “Barack Hussein Obama II (born August 4, 1961)
columns and relation identification is based on maximum is the 44th and current President of the United States. He
likelihood hypothesis, i.e., the best class label (or relation) is the first African American to hold the office.” The over-
is one that maximizes the probability of the values given all meaning can be understood from the meaning of words
the class label (or relation) for the column. They also rely in the sentence. The meaning of each word can be can be
on a isA database they create from the text on the Web recovered from the word itself or by using context of the
which may not be semantically rich. Their work also does surrounding words.
City State Mayor Population
Baltimore MD S.C.Rawlings-Blake 640,000
Philadelphia PA M.Nutter 1,500,000
New York NY M.Bloomberg 8,400,000
Boston MA T.Menino 610,000
@prefix rdfs: .
@prefix dbpedia: .
@prefix dbpedia-owl: .
@prefix dbpprop: .
“City”@en is rdfs:label of dbpedia-owl:City.
“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.
“Baltimore”@en is rdfs:label of dbpedia:Baltimore.
dbpedia:Baltimore a dbpedia-owl:City.
“MD”@en is rdfs:label of dbpedia:Maryland.
dbpedia:Maryland a dbpedia-owl:AdministrativeRegion.
Figure 3: This example shows a simple table about
cities in the United States and some output of the
prototype system that represents the extracted in-
formation as linked data annotated with additional
metadata. We use the N3 serialization of RDF for
Figure 2: Tables in clinical trials literature have readability.
characteristics that differ from typical, generic Web
tables. They often have row headers well as column
headers, most of the cell values are numeric, cell val- evidence provided by the row headers and the expanded form
ues are often structured and captions can contain of the abbreviation ITT, it can be inferred that the column
detailed metadata. (From [18]) header maps to dbpedia:Intention to treat analysis which is
a type of yago:ClinicalTrials. Once it is known that the
row headers represent dosages and column one represents a
Now consider the table in Figure 2 which has data on the type of analysis in a clinical trials, it can be inferred that
eradication rates for different treatment regimens for a dis- the data values in column one represent eradication rate for
ease, in this case H-pylori. The abbreviations in the row some disease for a given dosage. That some disease in our
header of the table represent the different treatment regi- example would be known from the caption of the table.
mens and the abbreviations in the column headers repre-
sent the different types of analyses used in the clinical trial. Consider the table shown in Figure 3. The column head-
The data values in the table indicate the number of patients ers suggest the type of information in the columns: city
cured for a particular regimen and under a particular anal- and state might match classes in a target ontology such
ysis. There is often additional information encoded in the as DBpedia [2]; mayor and population could match prop-
table which is not directly evident, for example in this table erties in the same or related ontologies. Examining the
the drugs used in the treatment are some combination of a data values, which are initially just strings, provides addi-
Proton pump inhibitor drug and antibiotics. tional information that can confirm some possibilities and
disambiguate between possibilities for others. For exam-
It is clear from this example that true meaning associated ple, the strings in column one can be recognized as en-
with a table is often encoded in its structure, column (and tity mentions that are instances of the dbpedia-owl:Place
row) headers of the table, the relations implicit between the class. Additional analysis can automatically generate a nar-
various columns and the values (string or literal) in the table. rower description such as major cities located in the United
Evidence to what a table means may also come from the States(yago:IndependentCitiesInTheUnitedStates).
caption associated with it as well as the free text surrounding
the table. Consider the strings in column three. The string by them-
selves suggest that they are politicians. The column header
How does one interpret what the column (or row) headers, provides additional evidence and better interpretation that
data values intend to convey? Expanding the abbreviations the strings in column three are actually mayors. Discov-
in the row headers will produce strings that map to existing ering relations between columns is important as well. By
entities from a knowledge base. For example OA will map identifying relation between column one and column three,
to dbpedia:Omeprazole and dbpedia:Amoxicillin. A combi- we can infer that the strings in column three are mayors
nation of drugs in the given string indicates that the string of cities presented in column one. Linking the table cell
is a type of dosage or treatment regimens. Once all the values to known entities enriches the table further. Link-
row headers are disambiguated, using information from the ing S.C.Rawlings-Blake to dbpedia:Stephanie C. Rawlings-
Linked Open Data cloud, we can infer additional informa- Blake, T.Menino to dbpedia:Thomas Menino , M.Nutter to
tion encoded in the table that all the drugs are combination dbpedia:Michael Nutter we can automatically infer additional
of a Proton pump inhibitor and antibiotics. information that all three belong to the Democratic party,
since the information will be associated with the linked en-
The numbers in the first column of the table in Figure 2 tities.
and the way they are represented indicate that it is some
form of a count/total. Using this evidence along with the Column four in this table presents literal values. The num-
bers in the column are values of the property dbpedia-owl:po- Linking table cells to entities. Using the predicted class
pulationTotal and this property can be associated with the labels as additional evidence, for every table cell, the algo-
cities in column one. All the values in the column are in rithm for linking table cell to entities, re-queries our KB. For
the range of 100,000. They provide evidence that the col- every table cell, the KB returns the top N possible entities.
umn may be representing the property population. Once For each of the top N entities, the algorithm generates a
relation between column one and column four is discovered, feature vector consisting of the entity’s KB score, entity’s
we can also look up on DBpedia, where the linked cities in Wikipedia page length, entity’s page rank, the Levenshtein
column one will further confirm that the numbers represent distance between the entity and the string in the query and
population of the respective cities. the Dice score between the entity and the string. The set of
feature vectors for each table cell are ranked using a SVM-
Producing an overall interpretation of a table is a complex Rank classifier. To the highest rank feature vector from
task that requires developing an overall understanding of SVM rank, two more features are added - the SVM rank
the intended meaning of the table as well as attention to score of the feature vector and the difference in SVM-Rank
the details of choosing the right URIs to represent both the scores between the top two feature vectors. A second SVM
schema as well as instances. We break down the process classifier decides whether to link the table cell to this top
into following tasks: a) assign every column (or row header) ranked entity or not. If the evidence is not strong enough,
a class label from an appropriate ontology b) link table cell it is likely that the table cell is a new entity not present in
values to appropriate LD entities, if possible c) discover rela- the KB; this step is useful in discovery of new entities in a
tionships between the table columns and link them to linked given table. If the evidence is strong enough, the table cell
data properties d) generate a linked data representation of is linked to the top ranked entity returned by SVM-Rank.
the inferred data.
Discovering relation between columns. Once the table
cells are linked, the framework identifies relations between
4. APPROACH table columns. For every pair of column, the algorithm gen-
In the following sections, we first present a baseline system erates a set of candidate relations from the relations that
that we developed to evaluate the feasibility in tackling the exist between the strings in each row of the two columns by
table interpretation problem. Later we present techniques querying DBpedia. The relation that gets majority vote is
for building a framework which overcomes the shortcomings chosen as the relation between the columns.
in the baseline system and a framework grounded in the
principles of graphical models and probabilistic reasoning. Linked data representation. We have developed a tem-
Finally we discuss challenges posed by tables in medical lit- plate for annotating and representing tables as linked RDF.
erature and present some techniques for dealing with them. We choose the N3 serialization because it is compact and
readable. The second part of Figure 3 shows an example
of a N3 representation of a table. To associate the column
4.1 The Baseline System header with its predicted class label, the rdfs:label property
from RDF Schema [3] is used. The rdfs:label property is also
The baseline system is a sequential, multi-step framework
used to associate the table cell string with its associated en-
that first maps every column header to a class from an ap-
tity from DBpedia. To associate the table string with its
propriate ontology. Using the predicted class as additional
type (i.e. class label of the column header), the rdf:type
evidence, it then links table cell values to entities from the
property is used.
Linked Data Cloud. The final step in the framework is dis-
covering relations between table columns and generating a
Evaluation of the baseline system. The baseline sys-
linked data representation of the table’s meaning.
tem was evaluated against 15 tables obtained from Google
Squared, Wikipedia and from a collection of tables extracted
Mapping column header to class. In a typical well
from the Web. Excluding the columns with numbers, the 15
formed table, each column contains data of a single syn-
tables have 52 columns and 611 entities for evaluation of our
tactic type (e.g., strings) that represent entities or values of
algorithms. We used a subset of 23 columns for evaluation
a common semantic type (e.g., people, places, yearly salary
of relation identification between columns.
in USD). The column’s header, if present, may name or de-
scribe the semantic type or perhaps a relation in which the
In the first evaluation of the algorithm for assigning class
column values participate. The algorithm determines the
labels to columns, we compared the ranked list of possible
class for a table column based on the class of the individual
class labels generated by the system against the list of pos-
strings in the column. For all the cell values in every col-
sible class labels ranked by the evaluators. For 80.76 % of
umn of the table, the algorithm submits a complex query to
the columns the average precision between the system and
the Wikitology [13] knowledge base to determine the type
evaluators list was greater than 0 which indicates that there
of each cell value in the column. For every query, the KB
was at least one relevant label in the top three of the sys-
returns a set of entities; each entity has a set of classes as-
tem ranked list. The mean average precision for 52 columns
sociated with it. Combining the classes of all the entities,
was 0.411.For 75 % of the columns, the recall of the algo-
produces a set of candidate classes for a column. Each class
rithm was greater than or equal to 0.6. We also assessed
label from the set of candidate class labels is scored. The
whether our predicted class labels were reasonable based on
class label with the highest score is chosen as the class label
the judgement of human subjects. 76.92 % of the class labels
to be associated with the column. We predict class labels
predicted were considered correct by the evaluators. The ac-
from four vocabularies: DBpedia Ontology, Freebase, Word-
curacy in each of the four categories is shown in Figure 4.
Net, and Yago.
Figure 4: Category wise accuracy for (a) “column
correctness” and (b) entity linking.
66.12 % of the table cell strings were correctly linked by our
algorithm for linking table cells. The breakdown of accuracy
based on the categories is shown in Figure 4. Our dataset
had 24 new entities and our algorithm was able to correctly
predict for all the 24 entities as new entities not present in
the KB. We did not get encouraging results for relationship
identification with an accuracy of 25 %. Figure 5: Parametrized Markov network. The
square nodes are the factor nodes in the graph
4.2 Joint Inference over a table
The baseline system makes local decision at each step of the To represent the distribution associated with the graph struc-
framework. The disadvantage of such a system is that error ture, we need to parametrize the structure. One way to
percolates from the previous phase to the next phase which parametrize a Markov network is representing the graph as
can lead to an overall poor interpretation of a table. To a factor graph. A factor graph is an undirected graph con-
overcome this problem, we are developing a framework that taining two types of nodes : variable nodes and factor nodes.
performs joint inference over the evidence available in the The graph has edges only between the factor nodes and vari-
table and jointly assign values to the column headers, table able nodes. A factor node captures and computes the affinity
cell values and relations between columns. between the variables interacting at that factor node. Vari-
able nodes can also have associated “node potentials”. Our
Probabilistic graphical models [8] provide convenient frame- parametrized graph (Figure 5) consists of two node poten-
work for expressing a joint probability over a set of vari- tials (associated with each of the column headers and table
ables in a system and perform inferencing over them. Con- cell values) and three factor nodes.
structing a graphical model involves the following steps:
a)Identifying variables in the system b)Identifying interac- The node potential for column header variable computes
tions between variables and representing it as a graph c)Par- the affinity between the string in the column header and
ametrizing the graphical structure d) Selecting an appropri- the class its being mapped to. The node potential for table
ate algorithm for inferencing. In this paper, we present the cell value computes the affinity between the string in the
first three steps in constructing a graphical model for inter- table cell and the entity its being linked to. The function
preting tables. of the three factor node is as follows: the first factor node
computes affinity between the class being assigned to the
Variables in the system. The column headers, the ta- column header and the entities linked to the cell values in
ble cell values and the relations between columns in the ta- the column; the second factor node computes the affinity
ble represent the set of variables in the table interpretation between the classes that have been assigned to all the column
framework. headers; and the third factor node computes the affinity
between the entities linked to the cell values in a given row.
Graphical Representation. We choose a Markov network We are presently working on defining the functions in the
based graphical representation,since the interaction between factor node that will compute the affinity between the values
the column headers, table cell values and relation between assigned to the various variables in the system.
table columns are symmetrical. The interaction between a
column header and cell values in the column is captured by
inserting an edge between the column header and each of 4.3 Challenges
the values in the column in the graph. To correctly disam- Results of our baseline system demonstrated feasibility in
biguate what a table cell value is, evidence from the rest of interpreting tables as proposed above. In the following sec-
the values in the same row can be used. This is captured by tion we present techniques for dealing with challenges posed
inserting edges between every pair of cell values in a given by tables in the medical literature and how such tables can
row. Similar interaction exists between the column headers be adapted to be processed using our existing techniques.
and is captured by the edges between every pair of table
column headers. Abbreviations. Tables from the medical literature tend to
use abbreviations a lot, primarily to represent dosages, type
A parametrized Markov network. of analyses used in the clinical trials, types of tests conducted
and so on. Like in table 2, the meaning of the abbreviations [1] C. Bizer. The emerging web of linked data. IEEE
are often encoded in the table caption. A pre-processing Intelligent Systems, 24(5):87–92, 2009.
step would involve processing the table caption to generate [2] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,
abbreviations and their expansions and then replacing the C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a
abbreviations in the table. crystallization point for the web of data. Journal of
Web Semantics, 7(3):154–165, 2009.
Literals. Literals pose a unique challenge, especially for ta- [3] D. Brickley and R. Guha. RDF Vocabulary
bles from medical literature. We demonstrated how strings Description Language 1.0: RDF Schema. W3C
in a column can be used as evidence in a table interpreta- Recommendation, World Wide Web Consortium,
tion framework. But what about literals like numerical data February 2004.
values in a table? To begin with, the range of numbers in a [4] M. J. Cafarella, A. Y. Halevy, Z. D. Wang, E. Wu,
given column can start providing evidence about what the and Y. Zhang. Webtables: exploring the power of
column is. For example if the numbers are in the range of tables on the web. PVLDB, 1(1):538–549, 2008.
100s’ then the column could be percentages or ages. The row [5] A. Cohen, C. Adams, J. Davis, C. Yu, P. Yu,
(or column) header may have additional clues. For example, W. Meng, L. Duggan, M. McDonagh, and
in the case of percentages, the % sign maybe associated with N. Smalheiser. Evidence-based medicine, the essential
the numbers in the table cell or it may be present in the row role of systematic reviews, and the need for automated
(or column) header in the table. text mining tools. In Proc. 1st ACM Int. Health
Informatics Symposium, pages 376–380. ACM, 2010.
This brings us to next thing that needs to be extracted from
[6] D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on
such tables - units associated with numbers. The units as-
contemporary table recognition. In Document Analysis
sociated with numerical data is either encoded in the row
Systems, pages 164–175, 2006.
(or column) header of the table or caption of the table. An
important step will be identifying the individual units to be [7] M. Hurst. Towards a theory of tables. IJDAR,
associated with numerical data in the table. 8(2-3):123–131, 2006.
[8] D. Koller and N. Friedman. Probabilistic Graphical
Finally numerical data is often represented in pairs. Formats Models: Principles and Techniques. MIT Press, 2009.
like number/count, number(%), % (number), number,unit [9] G. Limaye, S. Sarawagi, and S. Chakrabarti.
are some examples of how numerical data is encountered in Annotating and searching web tables using entities,
tables in medical literature. The meaning of this format is types and relationships. In Proc. 36th Int’l Conference
present again in the table caption or in the table header. on Very Large Databases, 2010.
[10] V. Mulwad, T. Finin, Z. Syed, and A. Joshi. Using
Table Interpretation. A useful interpretation of tables linked data to interpret tables. In Proc. 1st Int.
used in meta-analysis would be identifying and linking the Workshop on Consuming Linked Data, Nov. 2010.
drugs used in the treatment, identifying the type of analyses [11] D. Sackett, W. Rosenberg, J. Gray, R. Haynes, and
performed, success rate, identifying and linking to the dis- W. Richardson. Evidence based medicine: what it is
ease(s) under consideration, adverse events in the treatments and what it isn’t. Bmj, 312(7023):71, 1996.
if any and generating a linked data representation of it. Once [12] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
the “pre-processing steps” mentioned above, some of our ex- Core of Semantic Knowledge. In 16th Int. World Wide
isting techniques can be adapted to link the row and column Web Conf., New York, 2007. ACM Press.
headers to either classes or entities from a knowledgebase [13] Z. Syed and T. Finin. Creating and Exploiting a
and then generating the requisite linked data interpretation Hybrid Knowledge Base for Linked Data. Revised
of the table. Selected Papers Series: Communications in Computer
and Information Science. Springer, April 2011.
5. CONCLUSION [14] Z. Syed, T. Finin, V. Mulwad, and A. Joshi.
Generating an explicit representation of the meaning im- Exploiting a Web of Semantic Data for Interpreting
plicit in tabular data will support automatic integration and Tables. In Proc. 2nd Web Science Conf., April 2010.
more accurate search. Clues for a table’s intended meaning [15] P. Venetis, A. Halevy, J. Madhavan, M. Pasca,
are present in column and row headers, cell values, implicit W. Shen, F. Wu, G. Miao, and C. Wu. Recovering
relations between columns, and any descriptive text. We de- semantics of tables on the web. In Proc. 37th Int.
scribed general techniques grounded in graphical models and Conf, on Very Large Databases, 2011.
probabilistic reasoning to infer a table’s meaning relative to [16] J. Wang, B. Shao, H. Wang, and K. Q. Zhu.
a knowledge base of general and domain-specific knowledge Understanding tables on the web. Technical report,
expressed in the Semantic Web language OWL. We repre- Microsoft Research Asia, 2011.
sent a table’s meaning as a graph of OWL triples where the [17] W. Wu, H. Li, H. Wang, and K. Zhu. Towards a
columns have been mapped to classes, cell values to liter- probabilistic taxonomy of many concepts. Technical
als, measurements, or knowledge-base entities and relations report, Microsoft Research Asia, 2011.
to triples. One practical usecase we are studying is repre- [18] R. Zagari, G. Bianchi-Porro, R. Fiocca, G. Gasbarrini,
senting the meaning of tables found in papers from medical E. Roda, and F. Bazzoli. Comparison of 1 and 2 weeks
journals. We discussed some of the challenges presented by of omeprazole, amoxicillin and clarithromycin
these tables and described techniques to tackle them. treatment for helicobacter pylori eradication: the
hyper study. Gut, 56(4):475, 2007.
6. REFERENCES