=Paper= {{Paper |id=None |storemode=property |title=Generating Linked Data by Inferring the Semantics of Tables |pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p17-Mulwad.pdf |volume=Vol-880 |dblpUrl=https://dblp.org/rec/conf/vlds/MulwadFJ11 }} ==Generating Linked Data by Inferring the Semantics of Tables== https://ceur-ws.org/Vol-880/VLDS-p17-Mulwad.pdf
                                 Generating Linked Data by Inferring
                                     the Semantics of Tables ∗

                                             Varish Mulwad, Tim Finin and Anupam Joshi
                                                   Computer Science and Electrical Engineering
                                                     University of Maryland, Baltimore County
                                                         Baltimore, Maryland 21250 USA
                                                      {varish1, finin, joshi}@cs.umbc.edu

ABSTRACT                                                                             uments. Cafarella et al. [4] estimated that the Web contains
Vast amounts of information is encoded in structured ta-                             over 150 million high quality relational tables. In some ways,
bles found in documents, on the Web, and in spreadsheets                             this information is easier to understand because of its struc-
or databases. Integrating or searching over this information                         ture but in other ways it is more difficult because it lacks
benefits from understanding its intended meaning. Evidence                           the normal organization and context of narrative text. Both
for a table’s meaning can be found in its column headers, cell                       integrating or searching over this information will benefit
values, implicit relations between columns, caption and sur-                         from a better understanding of its intended meaning.
rounding text but also requires general and domain-specific
background knowledge. We represent a table’s meaning by                              A wide variety of domains that are interesting both tech-
mapping columns to classes in an appropriate ontology, link-                         nically and from a business perspective have tabular data.
ing cell values to literal constants, implied measurements, or                       These include medicine, healthcare, finance, e-science (e.g.,
entities in the linked data cloud (existing or new) and dis-                         biotechnology), and public policy. Key information in the
covering or and identifying relations between columns. We                            literature of these domains, which can be very useful for in-
describe techniques grounded in graphical models and prob-                           forming public policy, is often encoded in tables. As a part
abilistic reasoning to infer meaning (semantics) associated                          of Open Data and transparency initiative, fourteen nations
with a table. Using background knowledge from the Linked                             including the United States of America share data and infor-
Open Data cloud, we jointly infer the semantics of column                            mation on websites like www.data.gov in structured format
headers, table cell values (e.g.,strings and numbers) and re-                        like CSV, XML. As of May 2011, there are nearly 390,000
lations between columns and represent the inferred meaning                           raw datasets available. This represents a large source of
as graph of RDF triples. We motivate the value of this                               knowledge, yet we do not have systems that can understand
approach using tables from the medical domain, discussing                            and exploit this knowledge.
some of the challenges presented by these tables and describ-
ing techniques to tackle them.                                                       Many real world problems and applications can benefit from
                                                                                     exploiting information stored in tables including evidence
Keywords                                                                             based medical research [11]. Its goal is to judge the efficacy
                                                                                     of drug dosages and treatments by performing meta-analyses
Semantic Web, linked data, human language technology,entity
                                                                                     (i.e systematic reviews) over published literature and clini-
linking, information retrieval
                                                                                     cal trials. The process involves finding appropriate studies,
                                                                                     extracting useful data from them and performing statistical
1.     INTRODUCTION                                                                  analysis over the data to produce a evidence report. Key in-
Most of the information found on the Web consists of text                            formation required to produce evidence reports include data
written in a conventional style, e.g. as news stories, blogs,                        such as patient demographics, drug dosage information, dif-
reports, letters, advertisements, etc. There is also a signifi-                      ferent types of drugs used, brands of the drugs used, number
cant amount of information encoded in structured forms like                          of patients cured with a particular dosage etc. Most of this
tables and spreadsheets, including stand-alone spreadsheets                          information is encoded in tables, which are currently beyond
or table as well as tables embedded Web pages or other doc-                          the scope of regular text processing systems and search en-
∗This research was supported in part by NSF awards 0326460 and                       gines. This makes the process manual and cumbersome for
0910838,MURI award FA9550-08-1-0265 from AFOSR, and a gift from
                                                                                     medical researchers.
Microsoft Research.
                                                                                     Presently medical researchers perform keyword based search
                                                                                     on systems such as PubMed’s MEDLINE which end up pro-
                                                                                     ducing many irrelevant studies, requiring researchers to man-
Permission to make digital or hard copies of all or part of this work for            ually evaluate all of the studies to select the relevant ones.
personal or classroom use is granted without fee provided that copies are            Figure 1 obtained from [5] clearly shows the huge difference
not made or distributed for profit or commercial advantage and that copies           in number of meta-analysis and number of clinical trials pub-
bear this notice and the full citation on the first page. To copy otherwise, to      lished every year. By adding semantics to such tables, we
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at the workshop Very             can develop systems that can easily correlate, integrate and
Large Data Search (VLDS) 2011.                                                       search over different tables from different studies to be com-
Copyright 2011
                                                                  not attempt to link the table cell values.

                                                                  Limaye et al. [9] describe a system based on a graphical
                                                                  model which maps every column header to a class from
                                                                  a known ontology, links table cell values to entities from
                                                                  a knowledge-base and identifies relations between columns.
                                                                  They rely on Yago [12] for background knowledge.

                                                                  Current systems for interpreting tables rely on semantically
                                                                  poor and possibly noisy knowledge-bases. Neither do they
                                                                  focus on a “complete interpretation” of a table. None of
                                                                  the current systems propose or generate any form of linked
Figure 1: The number of papers reporting on sys-                  data from the inferred meaning. A key missing component
tematic reviews and meta-analyses is small com-                   in current systems is tackling literal constants. The work
pared to those reporting on individual clinical trials,           mentioned above will work well with string based tables. To
as shown in this data from MEDLINE.                               the best of our knowledge, no work has tackled the problem
                                                                  on interpreting literals in tables and using them as evidence
                                                                  in the table interpretation framework. To interpret tables
bined for a single meta-analysis.                                 from specialized domains such as medical research will re-
                                                                  quire incorporating modules that can understand literals.
In this paper, we present techniques to infer the intended
meaning of tables by jointly inferring the semantics of col-      Several systems have been implemented to generate Seman-
umn headers, table cell values (e.g., strings and numbers),       tic Web data from databases and spreadsheets. Virtually
relations between columns, augmented with background kn-          all are manual or semi-automated and none has focused on
owledge from open data sources such as the Linked Open            automatically generating linked RDF data. None of the sys-
Data cloud [1]. Our framework maps columns to classes             tems or methods proposed above focus on a truly complete
from an appropriate ontology, links cell values to literal con-   automated interpretation of a table. Current systems on
stants or entities in the linked data cloud (existing or new)     the Semantic Web either require users to specify the map-
and discovers or and identifies relations between columns.        ping to translate relational data to RDF or systems that
The interpreted meaning is represented as machine under-          do it automatically focus only a part of the table (like col-
standable linked RDF assertions.                                  umn header strings). These systems have mainly focused on
                                                                  relational databases or simple spreadsheets. The key short-
2.   RELATED WORK                                                 coming in such systems is that they rely heavily on users
                                                                  and their knowledge of the Semantic Web. Most systems on
Early work in table understanding focused on extracting ta-
                                                                  the Semantic Web also do not automatically link classes and
bles from documents and web pages [7, 6]. While progress
                                                                  entities generated from their mapping to existing resources
has been made in identifying the structure of the table, rel-
                                                                  on the Semantic Web. The output of such systems turns out
atively little work has been focused on understanding the
                                                                  to be just “raw string data” represented as RDF, instead of
semantics and meaning associated with tables. Recently,
                                                                  generating high quality linked RDF.
three groups have focused on understanding the meaning
associated with tables.
                                                                  The framework we present is complete automated interpre-
                                                                  tation of a table that focuses on all aspects of a table - col-
Wang et al. [16] use an approach that begins by identifying
                                                                  umn headers, row values, relations between columns. Our
a single ‘entity column’ in a table and, based on its values
                                                                  framework will not only tackle strings but also handle liter-
and rest of the column headers, associate a concept with the
                                                                  als and work across multiple domains - web tables, medical
table. Their work focuses only on identifying the concept to
                                                                  and open government data.
be associated with the table (i.e., with the “entity column”).
The concepts come from the Probase [17] knowledge base
created from the text on the World Wide Web. Such con-
cepts may not be semantically rich as compared to concepts        3.   INTERPRETING A TABLE
from DBpedia or the Linked Open Data cloud. Their work            One might be tempted to think that regular text processing
does not attempt to link the table cell values or identify        might work with tables as well. After all tables also store
relations between columns.                                        text. However that is not the case. It is said that tables store
                                                                  information in a “structured form”. It is this very structure
Ventis et al. [15] use framework associating multiple class       used to represent the data, that hinders systems from under-
labels (or concepts) with columns in a table. They identify       standing the intended meaning of a table. To differentiate
relations between the ‘subject’ column and the rest of the        between text processing and table processing consider the
columns in the table. Both the concept identification for         the text “Barack Hussein Obama II (born August 4, 1961)
columns and relation identification is based on maximum           is the 44th and current President of the United States. He
likelihood hypothesis, i.e., the best class label (or relation)   is the first African American to hold the office.” The over-
is one that maximizes the probability of the values given         all meaning can be understood from the meaning of words
the class label (or relation) for the column. They also rely      in the sentence. The meaning of each word can be can be
on a isA database they create from the text on the Web            recovered from the word itself or by using context of the
which may not be semantically rich. Their work also does          surrounding words.
                                                                            City       State         Mayor           Population
                                                                         Baltimore      MD     S.C.Rawlings-Blake      640,000
                                                                        Philadelphia    PA          M.Nutter          1,500,000
                                                                         New York       NY        M.Bloomberg         8,400,000
                                                                          Boston        MA         T.Menino            610,000


                                                                      @prefix rdfs: .
                                                                      @prefix dbpedia: .
                                                                      @prefix dbpedia-owl: .
                                                                      @prefix dbpprop: .

                                                                      “City”@en is rdfs:label of dbpedia-owl:City.
                                                                      “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.
                                                                      “Baltimore”@en is rdfs:label of dbpedia:Baltimore.
                                                                      dbpedia:Baltimore a dbpedia-owl:City.
                                                                      “MD”@en is rdfs:label of dbpedia:Maryland.
                                                                      dbpedia:Maryland a dbpedia-owl:AdministrativeRegion.

                                                                   Figure 3: This example shows a simple table about
                                                                   cities in the United States and some output of the
                                                                   prototype system that represents the extracted in-
                                                                   formation as linked data annotated with additional
                                                                   metadata. We use the N3 serialization of RDF for
Figure 2: Tables in clinical trials literature have                readability.
characteristics that differ from typical, generic Web
tables. They often have row headers well as column
headers, most of the cell values are numeric, cell val-            evidence provided by the row headers and the expanded form
ues are often structured and captions can contain                  of the abbreviation ITT, it can be inferred that the column
detailed metadata. (From [18])                                     header maps to dbpedia:Intention to treat analysis which is
                                                                   a type of yago:ClinicalTrials. Once it is known that the
                                                                   row headers represent dosages and column one represents a
Now consider the table in Figure 2 which has data on the           type of analysis in a clinical trials, it can be inferred that
eradication rates for different treatment regimens for a dis-      the data values in column one represent eradication rate for
ease, in this case H-pylori. The abbreviations in the row          some disease for a given dosage. That some disease in our
header of the table represent the different treatment regi-        example would be known from the caption of the table.
mens and the abbreviations in the column headers repre-
sent the different types of analyses used in the clinical trial.   Consider the table shown in Figure 3. The column head-
The data values in the table indicate the number of patients       ers suggest the type of information in the columns: city
cured for a particular regimen and under a particular anal-        and state might match classes in a target ontology such
ysis. There is often additional information encoded in the         as DBpedia [2]; mayor and population could match prop-
table which is not directly evident, for example in this table     erties in the same or related ontologies. Examining the
the drugs used in the treatment are some combination of a          data values, which are initially just strings, provides addi-
Proton pump inhibitor drug and antibiotics.                        tional information that can confirm some possibilities and
                                                                   disambiguate between possibilities for others. For exam-
It is clear from this example that true meaning associated         ple, the strings in column one can be recognized as en-
with a table is often encoded in its structure, column (and        tity mentions that are instances of the dbpedia-owl:Place
row) headers of the table, the relations implicit between the      class. Additional analysis can automatically generate a nar-
various columns and the values (string or literal) in the table.   rower description such as major cities located in the United
Evidence to what a table means may also come from the              States(yago:IndependentCitiesInTheUnitedStates).
caption associated with it as well as the free text surrounding
the table.                                                         Consider the strings in column three. The string by them-
                                                                   selves suggest that they are politicians. The column header
How does one interpret what the column (or row) headers,           provides additional evidence and better interpretation that
data values intend to convey? Expanding the abbreviations          the strings in column three are actually mayors. Discov-
in the row headers will produce strings that map to existing       ering relations between columns is important as well. By
entities from a knowledge base. For example OA will map            identifying relation between column one and column three,
to dbpedia:Omeprazole and dbpedia:Amoxicillin. A combi-            we can infer that the strings in column three are mayors
nation of drugs in the given string indicates that the string      of cities presented in column one. Linking the table cell
is a type of dosage or treatment regimens. Once all the            values to known entities enriches the table further. Link-
row headers are disambiguated, using information from the          ing S.C.Rawlings-Blake to dbpedia:Stephanie C. Rawlings-
Linked Open Data cloud, we can infer additional informa-           Blake, T.Menino to dbpedia:Thomas Menino , M.Nutter to
tion encoded in the table that all the drugs are combination       dbpedia:Michael Nutter we can automatically infer additional
of a Proton pump inhibitor and antibiotics.                        information that all three belong to the Democratic party,
                                                                   since the information will be associated with the linked en-
The numbers in the first column of the table in Figure 2           tities.
and the way they are represented indicate that it is some
form of a count/total. Using this evidence along with the          Column four in this table presents literal values. The num-
bers in the column are values of the property dbpedia-owl:po-      Linking table cells to entities. Using the predicted class
pulationTotal and this property can be associated with the         labels as additional evidence, for every table cell, the algo-
cities in column one. All the values in the column are in          rithm for linking table cell to entities, re-queries our KB. For
the range of 100,000. They provide evidence that the col-          every table cell, the KB returns the top N possible entities.
umn may be representing the property population. Once              For each of the top N entities, the algorithm generates a
relation between column one and column four is discovered,         feature vector consisting of the entity’s KB score, entity’s
we can also look up on DBpedia, where the linked cities in         Wikipedia page length, entity’s page rank, the Levenshtein
column one will further confirm that the numbers represent         distance between the entity and the string in the query and
population of the respective cities.                               the Dice score between the entity and the string. The set of
                                                                   feature vectors for each table cell are ranked using a SVM-
Producing an overall interpretation of a table is a complex        Rank classifier. To the highest rank feature vector from
task that requires developing an overall understanding of          SVM rank, two more features are added - the SVM rank
the intended meaning of the table as well as attention to          score of the feature vector and the difference in SVM-Rank
the details of choosing the right URIs to represent both the       scores between the top two feature vectors. A second SVM
schema as well as instances. We break down the process             classifier decides whether to link the table cell to this top
into following tasks: a) assign every column (or row header)       ranked entity or not. If the evidence is not strong enough,
a class label from an appropriate ontology b) link table cell      it is likely that the table cell is a new entity not present in
values to appropriate LD entities, if possible c) discover rela-   the KB; this step is useful in discovery of new entities in a
tionships between the table columns and link them to linked        given table. If the evidence is strong enough, the table cell
data properties d) generate a linked data representation of        is linked to the top ranked entity returned by SVM-Rank.
the inferred data.
                                                                   Discovering relation between columns. Once the table
                                                                   cells are linked, the framework identifies relations between
4.    APPROACH                                                     table columns. For every pair of column, the algorithm gen-
In the following sections, we first present a baseline system      erates a set of candidate relations from the relations that
that we developed to evaluate the feasibility in tackling the      exist between the strings in each row of the two columns by
table interpretation problem. Later we present techniques          querying DBpedia. The relation that gets majority vote is
for building a framework which overcomes the shortcomings          chosen as the relation between the columns.
in the baseline system and a framework grounded in the
principles of graphical models and probabilistic reasoning.        Linked data representation. We have developed a tem-
Finally we discuss challenges posed by tables in medical lit-      plate for annotating and representing tables as linked RDF.
erature and present some techniques for dealing with them.         We choose the N3 serialization because it is compact and
                                                                   readable. The second part of Figure 3 shows an example
                                                                   of a N3 representation of a table. To associate the column
4.1    The Baseline System                                         header with its predicted class label, the rdfs:label property
                                                                   from RDF Schema [3] is used. The rdfs:label property is also
The baseline system is a sequential, multi-step framework
                                                                   used to associate the table cell string with its associated en-
that first maps every column header to a class from an ap-
                                                                   tity from DBpedia. To associate the table string with its
propriate ontology. Using the predicted class as additional
                                                                   type (i.e. class label of the column header), the rdf:type
evidence, it then links table cell values to entities from the
                                                                   property is used.
Linked Data Cloud. The final step in the framework is dis-
covering relations between table columns and generating a
                                                                   Evaluation of the baseline system. The baseline sys-
linked data representation of the table’s meaning.
                                                                   tem was evaluated against 15 tables obtained from Google
                                                                   Squared, Wikipedia and from a collection of tables extracted
Mapping column header to class. In a typical well
                                                                   from the Web. Excluding the columns with numbers, the 15
formed table, each column contains data of a single syn-
                                                                   tables have 52 columns and 611 entities for evaluation of our
tactic type (e.g., strings) that represent entities or values of
                                                                   algorithms. We used a subset of 23 columns for evaluation
a common semantic type (e.g., people, places, yearly salary
                                                                   of relation identification between columns.
in USD). The column’s header, if present, may name or de-
scribe the semantic type or perhaps a relation in which the
                                                                   In the first evaluation of the algorithm for assigning class
column values participate. The algorithm determines the
                                                                   labels to columns, we compared the ranked list of possible
class for a table column based on the class of the individual
                                                                   class labels generated by the system against the list of pos-
strings in the column. For all the cell values in every col-
                                                                   sible class labels ranked by the evaluators. For 80.76 % of
umn of the table, the algorithm submits a complex query to
                                                                   the columns the average precision between the system and
the Wikitology [13] knowledge base to determine the type
                                                                   evaluators list was greater than 0 which indicates that there
of each cell value in the column. For every query, the KB
                                                                   was at least one relevant label in the top three of the sys-
returns a set of entities; each entity has a set of classes as-
                                                                   tem ranked list. The mean average precision for 52 columns
sociated with it. Combining the classes of all the entities,
                                                                   was 0.411.For 75 % of the columns, the recall of the algo-
produces a set of candidate classes for a column. Each class
                                                                   rithm was greater than or equal to 0.6. We also assessed
label from the set of candidate class labels is scored. The
                                                                   whether our predicted class labels were reasonable based on
class label with the highest score is chosen as the class label
                                                                   the judgement of human subjects. 76.92 % of the class labels
to be associated with the column. We predict class labels
                                                                   predicted were considered correct by the evaluators. The ac-
from four vocabularies: DBpedia Ontology, Freebase, Word-
                                                                   curacy in each of the four categories is shown in Figure 4.
Net, and Yago.
Figure 4: Category wise accuracy for (a) “column
correctness” and (b) entity linking.


66.12 % of the table cell strings were correctly linked by our
algorithm for linking table cells. The breakdown of accuracy
based on the categories is shown in Figure 4. Our dataset
had 24 new entities and our algorithm was able to correctly
predict for all the 24 entities as new entities not present in
the KB. We did not get encouraging results for relationship
identification with an accuracy of 25 %.                         Figure 5: Parametrized Markov network.         The
                                                                 square nodes are the factor nodes in the graph
4.2   Joint Inference over a table
The baseline system makes local decision at each step of the     To represent the distribution associated with the graph struc-
framework. The disadvantage of such a system is that error       ture, we need to parametrize the structure. One way to
percolates from the previous phase to the next phase which       parametrize a Markov network is representing the graph as
can lead to an overall poor interpretation of a table. To        a factor graph. A factor graph is an undirected graph con-
overcome this problem, we are developing a framework that        taining two types of nodes : variable nodes and factor nodes.
performs joint inference over the evidence available in the      The graph has edges only between the factor nodes and vari-
table and jointly assign values to the column headers, table     able nodes. A factor node captures and computes the affinity
cell values and relations between columns.                       between the variables interacting at that factor node. Vari-
                                                                 able nodes can also have associated “node potentials”. Our
Probabilistic graphical models [8] provide convenient frame-     parametrized graph (Figure 5) consists of two node poten-
work for expressing a joint probability over a set of vari-      tials (associated with each of the column headers and table
ables in a system and perform inferencing over them. Con-        cell values) and three factor nodes.
structing a graphical model involves the following steps:
a)Identifying variables in the system b)Identifying interac-     The node potential for column header variable computes
tions between variables and representing it as a graph c)Par-    the affinity between the string in the column header and
ametrizing the graphical structure d) Selecting an appropri-     the class its being mapped to. The node potential for table
ate algorithm for inferencing. In this paper, we present the     cell value computes the affinity between the string in the
first three steps in constructing a graphical model for inter-   table cell and the entity its being linked to. The function
preting tables.                                                  of the three factor node is as follows: the first factor node
                                                                 computes affinity between the class being assigned to the
Variables in the system. The column headers, the ta-             column header and the entities linked to the cell values in
ble cell values and the relations between columns in the ta-     the column; the second factor node computes the affinity
ble represent the set of variables in the table interpretation   between the classes that have been assigned to all the column
framework.                                                       headers; and the third factor node computes the affinity
                                                                 between the entities linked to the cell values in a given row.
Graphical Representation. We choose a Markov network             We are presently working on defining the functions in the
based graphical representation,since the interaction between     factor node that will compute the affinity between the values
the column headers, table cell values and relation between       assigned to the various variables in the system.
table columns are symmetrical. The interaction between a
column header and cell values in the column is captured by
inserting an edge between the column header and each of          4.3    Challenges
the values in the column in the graph. To correctly disam-       Results of our baseline system demonstrated feasibility in
biguate what a table cell value is, evidence from the rest of    interpreting tables as proposed above. In the following sec-
the values in the same row can be used. This is captured by      tion we present techniques for dealing with challenges posed
inserting edges between every pair of cell values in a given     by tables in the medical literature and how such tables can
row. Similar interaction exists between the column headers       be adapted to be processed using our existing techniques.
and is captured by the edges between every pair of table
column headers.                                                  Abbreviations. Tables from the medical literature tend to
                                                                 use abbreviations a lot, primarily to represent dosages, type
A parametrized Markov network.                                   of analyses used in the clinical trials, types of tests conducted
and so on. Like in table 2, the meaning of the abbreviations      [1] C. Bizer. The emerging web of linked data. IEEE
are often encoded in the table caption. A pre-processing              Intelligent Systems, 24(5):87–92, 2009.
step would involve processing the table caption to generate       [2] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,
abbreviations and their expansions and then replacing the             C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a
abbreviations in the table.                                           crystallization point for the web of data. Journal of
                                                                      Web Semantics, 7(3):154–165, 2009.
Literals. Literals pose a unique challenge, especially for ta-    [3] D. Brickley and R. Guha. RDF Vocabulary
bles from medical literature. We demonstrated how strings             Description Language 1.0: RDF Schema. W3C
in a column can be used as evidence in a table interpreta-            Recommendation, World Wide Web Consortium,
tion framework. But what about literals like numerical data           February 2004.
values in a table? To begin with, the range of numbers in a       [4] M. J. Cafarella, A. Y. Halevy, Z. D. Wang, E. Wu,
given column can start providing evidence about what the              and Y. Zhang. Webtables: exploring the power of
column is. For example if the numbers are in the range of             tables on the web. PVLDB, 1(1):538–549, 2008.
100s’ then the column could be percentages or ages. The row       [5] A. Cohen, C. Adams, J. Davis, C. Yu, P. Yu,
(or column) header may have additional clues. For example,            W. Meng, L. Duggan, M. McDonagh, and
in the case of percentages, the % sign maybe associated with          N. Smalheiser. Evidence-based medicine, the essential
the numbers in the table cell or it may be present in the row         role of systematic reviews, and the need for automated
(or column) header in the table.                                      text mining tools. In Proc. 1st ACM Int. Health
                                                                      Informatics Symposium, pages 376–380. ACM, 2010.
This brings us to next thing that needs to be extracted from
                                                                  [6] D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on
such tables - units associated with numbers. The units as-
                                                                      contemporary table recognition. In Document Analysis
sociated with numerical data is either encoded in the row
                                                                      Systems, pages 164–175, 2006.
(or column) header of the table or caption of the table. An
important step will be identifying the individual units to be     [7] M. Hurst. Towards a theory of tables. IJDAR,
associated with numerical data in the table.                          8(2-3):123–131, 2006.
                                                                  [8] D. Koller and N. Friedman. Probabilistic Graphical
Finally numerical data is often represented in pairs. Formats         Models: Principles and Techniques. MIT Press, 2009.
like number/count, number(%), % (number), number,unit             [9] G. Limaye, S. Sarawagi, and S. Chakrabarti.
are some examples of how numerical data is encountered in             Annotating and searching web tables using entities,
tables in medical literature. The meaning of this format is           types and relationships. In Proc. 36th Int’l Conference
present again in the table caption or in the table header.            on Very Large Databases, 2010.
                                                                 [10] V. Mulwad, T. Finin, Z. Syed, and A. Joshi. Using
Table Interpretation. A useful interpretation of tables               linked data to interpret tables. In Proc. 1st Int.
used in meta-analysis would be identifying and linking the            Workshop on Consuming Linked Data, Nov. 2010.
drugs used in the treatment, identifying the type of analyses    [11] D. Sackett, W. Rosenberg, J. Gray, R. Haynes, and
performed, success rate, identifying and linking to the dis-          W. Richardson. Evidence based medicine: what it is
ease(s) under consideration, adverse events in the treatments         and what it isn’t. Bmj, 312(7023):71, 1996.
if any and generating a linked data representation of it. Once   [12] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
the “pre-processing steps” mentioned above, some of our ex-           Core of Semantic Knowledge. In 16th Int. World Wide
isting techniques can be adapted to link the row and column           Web Conf., New York, 2007. ACM Press.
headers to either classes or entities from a knowledgebase       [13] Z. Syed and T. Finin. Creating and Exploiting a
and then generating the requisite linked data interpretation          Hybrid Knowledge Base for Linked Data. Revised
of the table.                                                         Selected Papers Series: Communications in Computer
                                                                      and Information Science. Springer, April 2011.
5.   CONCLUSION                                                  [14] Z. Syed, T. Finin, V. Mulwad, and A. Joshi.
Generating an explicit representation of the meaning im-              Exploiting a Web of Semantic Data for Interpreting
plicit in tabular data will support automatic integration and         Tables. In Proc. 2nd Web Science Conf., April 2010.
more accurate search. Clues for a table’s intended meaning       [15] P. Venetis, A. Halevy, J. Madhavan, M. Pasca,
are present in column and row headers, cell values, implicit          W. Shen, F. Wu, G. Miao, and C. Wu. Recovering
relations between columns, and any descriptive text. We de-           semantics of tables on the web. In Proc. 37th Int.
scribed general techniques grounded in graphical models and           Conf, on Very Large Databases, 2011.
probabilistic reasoning to infer a table’s meaning relative to   [16] J. Wang, B. Shao, H. Wang, and K. Q. Zhu.
a knowledge base of general and domain-specific knowledge             Understanding tables on the web. Technical report,
expressed in the Semantic Web language OWL. We repre-                 Microsoft Research Asia, 2011.
sent a table’s meaning as a graph of OWL triples where the       [17] W. Wu, H. Li, H. Wang, and K. Zhu. Towards a
columns have been mapped to classes, cell values to liter-            probabilistic taxonomy of many concepts. Technical
als, measurements, or knowledge-base entities and relations           report, Microsoft Research Asia, 2011.
to triples. One practical usecase we are studying is repre-      [18] R. Zagari, G. Bianchi-Porro, R. Fiocca, G. Gasbarrini,
senting the meaning of tables found in papers from medical            E. Roda, and F. Bazzoli. Comparison of 1 and 2 weeks
journals. We discussed some of the challenges presented by            of omeprazole, amoxicillin and clarithromycin
these tables and described techniques to tackle them.                 treatment for helicobacter pylori eradication: the
                                                                      hyper study. Gut, 56(4):475, 2007.
6.   REFERENCES