<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Semantic Exploration of Tables in Scientific Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Varish Mulwad</string-name>
          <email>varish.mulwad@ge.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vijay S. Kumar</string-name>
          <email>v.kumar1@ge.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Weisenberg Williams</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Finin</string-name>
          <email>finin@umbc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sharad Dixit</string-name>
          <email>sharad.dixit@ge.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GE Research</institution>
          ,
          <addr-line>1 Research Circle, Niskayuna, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GE Research, John F. Welch Technology Center</institution>
          ,
          <addr-line>Whitefield, Bengaluru</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Scientific Tables</institution>
          ,
          <addr-line>Table Characterization, Semantic Table Interpretation, Knowledge Graphs</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Maryland Baltimore County</institution>
          ,
          <addr-line>1000 Hilltop Circle, Baltimore, MD</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Structured data artifacts such as tables are widely used in scientific literature to organize and concisely communicate important statistical information. Discovering relevant information in these tables remains a significant challenge owing to their structural heterogeneity, dense and often implicit semantics, and diffuse context. This paper describes how we leverage semantic technologies to enable technical experts to search and explore tabular data embedded within scientific documents. We present a system for the on-demand construction of knowledge graphs representing scientific tables (drawn from online scholarly articles hosted by PubMed Central), and for synthesizing tabular responses to semantic search requests against such graphs. We discuss key differentiators in our overall approach, including a two-stage semantic table interpretation that relies on an extensive structural and syntactic characterization of scientific tables, and a prototype knowledge discovery engine that uses automatically-inferred semantics of scientific tables to serve search requests by potentially fusing information from multiple tables on the fly. We evaluate our system on a real-world dataset of approximately Semantic Search Proceedings of the ESWC 2023 Workshops and Tutorials; 1st International Workshop on SemTech4STLD, May 28, 2023, Hersonissos, Greece</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The discovery of relevant information from scientific documents has traditionally been critical to
the advancement of our understanding of real-world phenomena. The pace of research in dynamic,
fastevolving scenarios, as exemplified by COVID-19, can lead to unprecedented rates in publication of
scholarly literature on a given subject [1]. Furthermore, recent advances in generative AI and foundation
models have made it radically simpler and more economical for the scientific community to produce
and publish new technical content [2]. These emerging trends have necessitated more machine-driven,
human-interpretable approaches to scientific knowledge discovery. Open datasets like CORD-19 [3]
have motivated novel techniques to enable scientists to comprehend existing literature, including tools
for keyword/semantic search and Q&amp;A, recommendation, and summarization of scientific documents.
Newer models [4] and tools like ChatPDF [5] and Elicit [6] foster dialog-driven information discovery
from documents within various domains. However, as with web documents, the exploration of scientific
literature is predominantly associated with searching over its unstructured textual content.</p>
      <p>Besides text, alternative modalities such as tables and charts play a significant role in how the
scientific community conveys descriptive information in the literature. Our experience with assembling
a corpus of over 62,000 open-access coronavirus-related articles from PubMed Central [7] yielded over
120,000 tables underlining a wealth of latent knowledge embedded within such structured data artifacts.
Extraction and retrieval of relevant information from these scientific tables is becoming increasingly</p>
      <p>2023 Copyright for this paper by its authors.
critical to emerging knowledge-driven applications. For example, consider a genomic surveillance
scenario seeking information on treatment efficacies against the top prevalent COVID-19 variants in
each US state. Better responses to such queries may entail going beyond text and searching relevant
portions of or entire scientific tables for vital knowledge nuggets, possibly fusing information from
multiple source tables on the fly.</p>
      <p>Viewing scientific documents as a collection of content-rich tables and descriptive context, this
paper addresses the need for technical experts to search and systematically explore scientific tables in
their quest to discover new knowledge. In comparison to traditional web tables and open relational data,
tables found in scientific documents are generally (i) smaller in size, (ii) consist of cells with more
idiomatic strings or numerical data than text and (iii) contain specialized textual entities specific to a
scientific domain (see Figure 1). Moreover, unlike their relational/web table counterparts, the nature of
scientific tables reflects the more general circumstances of their creation: When researchers prepare
scientific manuscripts, their contents are typically subject to constraints in terms of “publication real
estate” (e.g., maximum page count for a paper). Consequently, especially with any tables present in
these articles, researchers resort to numerous information compaction practices to optimize for human
consumption and visual perception of information in the tables while ensuring that there is no
ambiguity, information overload or other loss in informational value associated with these tables.
Our work is motivated by the resulting distinctive characteristics of scientific tables listed below:
• Structural heterogeneity: In the absence of universal guidelines, tables in scientific documents
often come in a variety of structures and dimensions. To better fit tabular data within the available
space, researchers may (i) transform a simple table with only column headers into a more complex
one with row headers or both sets of headers, (ii) merge neighboring cells with repeated values into
one, or (iii) group logically related neighboring cells and extract any common information into
higher-level parent cells—potentially resulting in a hierarchy of sub-rows and/or sub-columns.
• Dense and often implicit semantics: Due to space constraints, researchers may (i) use abridged
table header names whose contextual description may explicitly capture dense semantics—e.g., a
header cell mort. rt. to signify mortality rate could either represent deaths per 1000 of infected
population or deaths per million of the total population, (ii) coalesce numerical quantities and their
associated units within the same table cells, or (iii) use headers where context is implicit, i.e., the
meaning of a header cell (e.g., ‘location’), and hence of an associated column or row, can be
ascertained only by inspecting the content of data cells in that column or row.
• Diffuse Context: Tables in scientific papers typically represent a summary view or a meta-analysis
over a more comprehensive set of information. (i) Contextual description of cell content may be
present in table captions or other referring text outside of the tables, (ii) Cells may include pointers
to other related tables instead of actual tabular data themselves. From a discovery viewpoint, it is
likely that information needed to put together a comprehensive response to a search request is
scattered across multiple tables and sources, potentially drawn from different technical documents.</p>
      <p>The very practices seeking to ease human understanding of published scientific tables, thus, also
introduce challenges that make it harder for machine-driven table understanding and discovery. As part
of our approach to address challenges posed by structural heterogeneity and domain-specific entities
(e.g., gene sequences, any measurements/units that hold a special meaning within a scientific domain),
we model scientific tables as linked data objects and construct knowledge graphs (KG) using
information extracted from these tables and metadata from their respective containing articles. In this
paper, we present a KG-creation system that employs a two-stage table interpretation methodology. For
each table, we first perform both an extensive characterization of the table structure as well as a syntactic
understanding of its individual cell contents. Next, semantic table interpretation, focusing on column
type annotation (CTA) and cell entity annotation (CEA), leverages the results of the first stage to map
table cells to nodes in an existing reference KG (e.g., Wikidata [8]). Semantic interpretation of web
tables is widely understood – While we build on our practical solution adapted to scientific tables in the
biomedical domain [9], it is not our key contribution. Our overall approach for scientific tables can
leverage existing alternative web table interpretation systems with appropriate adaptations. For
exploring our scientific tables-centric-KG, we additionally present a hybrid table-based semantic search
paradigm to support table-matching based on inferred semantics under various contextual constraints.</p>
      <p>
        Our specific contributions described in this paper include: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a dataset collection system to harvest
tables and provenance metadata from online open-access PubMed Central articles, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a preliminary
prototype system to generate a KG of scientific tables that includes a) annotation of the collected tables
along multiple dimensions using automated rule-based structural characterization and syntactic parsing
techniques, and b) knowledge graph embeddings-based approach to semantic table interpretation, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) a
promising preliminary evaluation over a subset of annotated data and a thorough characterization of
over 120,000 scientific tables that demonstrate high structural heterogeneity, (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) a pipeline to populate
RDF triples representing semantic annotations of table cells and table provenance in a KG, and (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) a
prototype discovery engine that leverages the inferred table semantics captured in the KG to support
discovery of tabular data by potentially merging semantically compatible tables on the fly. This paper
presents our first steps towards populating and exploring a KG of scientific tables. We are in the process
of open sourcing our current technology for the community to reproduce and build on our work2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Web Tables and Dataset Search: Given their ubiquity in communicating information, an extensive
body of prior research by the information retrieval and database communities covers the processing and
understanding of tables, particularly around inferring the intended meaning and semantics of both web
tables and Open Data for dataset search and discovery purposes [10][11][12][13]. Some approaches
allow table-based search and data enrichment where queries take the form of an input table—schema
matching is then performed to discover ‘unionable’ or ‘joinable’ tables that are ranked based on some
similarity criteria [12] to generate on-the-fly result tables [14]. However, existing approaches do not
specifically address the idiosyncrasies of tables found in scientific and technical documents.
Table Representation Learning: Recently, the use of large-scale, pretrained table representation
learning (TRL) models [15][16] for understanding and exploring tabular information has been gaining
traction across communities, with a few efforts even specifically addressing understanding of tables
found in scientific documents. Challenge datasets like PubTables-1M [17], ChemTables [18],
ArxivPapers [19], and SciGen [20]—all assembled from different forms of scientific/technical
literature—are fostering innovation in table detection, classification, extraction, context-aware
tableto-text generation, and question-answering respectively. Complementary to these, and from a discovery
perspective, our work targets on-the-fly synthesis of tabular knowledge from existing scientific tables.
Semantic Table Interpretation: The SemTab challenge [21] annually benchmarks semantic table
interpretation systems with a focus on relational web tables (scientific tables from biomedical papers
were included in recent editions). SemTab systems aim to map table data to nodes in a KG such as
Wikidata or DBpedia. Specifically, they map a column header string to a KG class (Column Type
2 As of submission, our code to assemble scientific tables from PMC articles on-demand is available under:
https://github.com/orgs/geknowledge-discovery/repositories. Code for the remaining systems will be made available here pending code-release approvals.
Annotation (CTA)), data cell string to a KG entity (Cell Entity Annotation (CEA)), and relationship
between pair of columns to a KG property (Column Property Annotation (CPA)). Existing systems
[22][23][24] could potentially be adapted for CTA and CEA in scientific tables to replace our current
approach. Beyond SemTab, semantic interpretation of scientific tables has seen limited research—[25]
leverages the MetaMap tool to annotate cells in PMC article tables with UMLS concepts.
QTLTableMiner++ [26] annotates tables in Europe PMC articles with concepts from Phenotype, Plant,
and other quantitative trait locus-specific ontologies. Our work goes further combining structural and
syntactical characterizations with a table’s semantic interpretation (via CTA and CEA) to generate a
more comprehensive semantic representation to support table search including on-the-fly data fusion.
Scientific Table Characterization: Existing techniques to formally characterize scientific literature
[27] cover many aspects of scholarly knowledge but are limited in their representation of tables found
in these documents. [28] presents a taxonomy for web tables, categorizing them into “relational
knowledge” tables (those from which knowledge can be extracted) and “layout” tables (no knowledge
to extract). Another taxonomy [29], henceforth referred to as the Lautert taxonomy, classifies web tables
primarily into five mutually exclusive categories (horizontal, vertical, matrix, formatting, navigational),
mirroring [28], but omits subcategories that are unlikely to be encountered in scientific documents (e.g.,
form, calendar), and uses more common names (e.g., “horizontal table” instead of “vertical listing
table”). The secondary classification includes five non-mutually exclusive categories (concise, nested,
multivalued simple, multivalued complex, splitted), each indicating a different methodology to
structurally format and present tabular content. Fang et al. [30] introduce additional labels (e.g.,
multilevel header, multi-dimensional table) that are applicable to scientific tables but are not adequately
covered in Lautert's scheme. To our knowledge, only Milosevic et al. [31][32] have developed a system
for structural characterization of tables in PMC articles. Their rule-based system detects and
distinguishes between header and data cells, detects basic data types for cells (e.g., strings, numbers),
and classifies a table into one of three types. Our proposed approach, in comparison, can generate
characterization at different levels (cell, row, column, table), generating values for 18 different labels.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Collection</title>
      <p>We developed an extensible, reusable data collection pipeline to download, parse, and extract a
dataset of tables and associated metadata from COVID-19-related PMC manuscripts. Inspired by the
CORD-19 dataset [3], we instantiate our pipeline with the following query search-string: (“COVID”
OR “COVID-19” OR “Coronavirus” OR “Corona virus” OR “2019-nCoV”). As of 06/26/2021, this
query resulted in over 200,000 matching PMC articles, of which 62,777 articles were present within the
PMC open-access, commercial-use collection subset. Further, of these, 62,746 articles contained
corresponding full-text encodings in XML format. Upon parsing these XML artifacts, our pipeline was
able to extract a total of 120,417 HTML-formatted tables from across the article corpus.</p>
    </sec>
    <sec id="sec-4">
      <title>4. A Knowledge Graph of Scientific Tables</title>
      <p>Figure 2 shows an overview of our system to semantically interpret tables discovered in scientific
literature. The HTML tables extracted by the dataset collection pipeline are first processed by a Table
Characterization module that infers the different structural and syntactic characteristics at the cell,
row, column, and table level, including the basic datatypes (e.g., numbers, strings) of table cells. The</p>
      <p>The table characterization module leverages several rules to identify different characteristics at the
cell, header, row, and table level. We first describe the taxonomy used to identify these characteristics
followed by our implementation details. To categorize tables, we start with the Lautert taxonomy [29]
and supplement it with additional classifications from [30]. Our characterization also includes
additional labels and metrics (e.g., number of rows, number of columns, basic data types for table cell
values, semantic types for certain set of commonly encoded data), which are generally useful and/or
needed to derive the Lautert [29] and Fang [30] classifications.
We identified and selected several
labels that can be applied at the
cell, row, column, header, and
table level. At the table level, a
main classification label can be
one of Horizontal, Vertical,
Matrix, or Multi-dimensional.</p>
      <p>The header level characterization
labels include Header Row,
Header Column, Header Row
Concise, Header Row Multilevel,
and Headers Splitted. The table
body (minus header cells) is
characterized to be concise or not
(Body Concise). Cell level labels
include checks for whether a cell is
multi-valued (further propagated
up to the table level) and also
several basic data types like string,
number, number with range, etc.</p>
      <p>
        Implementation: We implement a rule–based table characterization module which assigns 18 labels
applied at the row, column, or table label, as seen in Table 1. Internally, the algorithm may apply some
of these labels at the cell level as well. Labels may have one of three types of values: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) number, e.g.,
number of columns in a table (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Boolean, e.g., whether a table has multilevel header rows (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
enumeration, e.g., a given row’s data type. Some labels indicate that a table (or row/column) contains
a feature (e.g., HAS_COLSPAN=True), whereas some labels indicate that the table (or row/column) is
comprised entirely of the given feature (e.g., CELLS_DATATYPE=“number”). The implementation
generates labels using an incremental approach largely leveraging the HTML markup of the tables. In
general, rows and columns are characterized based upon inspection of their component cells. Likewise,
tables are characterized based upon inspection of their component rows and columns. The
characterization module follows a strict order driven by logical dependencies—it cannot determine if a
table is horizontal before characterizing the datatype of its columns; it will not characterize columns
before determining that a table is non-concise. Such dependencies drive how labels are calculated.
      </p>
      <p>An illustrative example follows, showing a subset of the module’s logic leading up to classifying
Table 1 from article PMC5923605 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5923605) as a
horizontal table (meaning that each row contains information for an entity). From the table’s HTML
markup, the module determines that it contains two body rows, and that for each row, no
CELL_DATATYPE can be assigned since it contains a mix of strings, numbers, and numbers with
tolerances. Further, the module determines that no body rows contain an HTML element
(rowspan/colspan) causing a cell to exceed the height or width of a single row or column, and therefore
classifies the table as BODY_CONCISE=false (does not contain merged cells). This classification
indicates that the table is a regular grid, and therefore the module can characterize its columns. It
determines that there are five columns, and that each has a consistent datatype (two columns are
CELL_DATATYPE=“string”, two columns are CELL_DATATYPE=“number”, and one column is
CELL_DATATYPE=“number with tolerance”). Note that if the column was mostly a certain datatype
but contained cells indicating missing values (whitespace, dash, or “NA”), then it will still correctly
assign the datatype. Finally, the module determines that since the table contains at least two columns
with distinct and differing datatypes (omitting the first column, which is often a header), it assigns
MAIN_CLASSIFICATION=“horizontal”. In a horizontal table, each row represents an entity, and each
column represents a characteristic of that entity. In this example table, the entities are the samples, and
the columns contain phases, particle sizes, densities, etc. of the sample.
4.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Syntactic Parsing &amp; Table Flattening</title>
      <p>Once a table’s structural and syntactic characteristics have been identified, we further process it to
simplify the semantic interpretation step. This includes identifying commonly encoded data in tables
and separating the core (query/entity) string from its contextual metadata. The former assists in
identifying a subset of the table (cells/rows/columns) that should not be semantically annotated while
the latter produces string content per cell for direct use by the semantic interpretation module.
Detecting &amp; Labelling Commonly Encoded Data: Accurately interpreting table semantics requires
not only linking strings to appropriate concepts/entities from a reference KG, but also not linking
strings/numbers that represent literals. While literals found in scientific tables can be classified into a
few basic types (e.g., strings, numbers, Boolean values, dates), it is useful to classify them into more
specific types. For tables in the biomedical domain, examples of such specific types include DNA/RNA
sequences, references, and clinical trial IDs. Similar to [33], we develop a specialist-based framework
for detecting commonly encoded data in tables. A “specialist”—either a regular
expression/patternbased, dictionary-based, or machine learning approach, independently assesses cells in a column/row
for the likelihood that the column belongs to a specific type that it is designed for. For example, there
could be independent specialists to detect zip codes, SSNs, and sequences. We implement specialists to
detect two basic types and three semantic types (Table 2). These were identified by inspecting common
entries in the tables in our dataset. Cells/rows/columns identified as one of these types do not require
any additional semantic interpretation. The syntactic parsing module uses each specialist to determine
the basic and/or semantic type for each cell. Further, it propagates this value to the column/row (header)
if all cells in a column/row have the same type (ignoring cells with empty entries, “NA”, or dashes).
Table Flattening: A final step before semantic interpretation involves normalizing or flattening a
subset of tables with “complex” structures. We leverage the inferred table characterizations to identify
tables with a single simple header row and concise body containing only row spans. We flatten such
tables using specialized data libraries (e.g., pandas) and re-export table HTML filling missing entries.</p>
      <p>We implement our table interpretation module for scientific literature with a focus on CTA and CEA
[21], mapping header and data cell strings to Wikidata [8] items (i.e., concepts and entities). At the
core of our module is a scalable entity linker designed and optimized for the biomedical domain [9].
Additionally, our module includes a joint inference component, which, wherever applicable,
collectively assigns entities to header and all data cells in a column.</p>
      <p>Core Entity Linker: We developed a practical, scalable entity linker that takes a mention string
(from a table cell) and retrieves ranked lists of top k Wikidata items that it deems to be best matches to
that mention string. In [9], we describe how we customized this linker for the biomedical domain and
optimized the linking process for hundreds of thousands of table cells by setting up a localized instance
of the Wikidata knowledge graph—specifically, indexing 95.8M items and the type hierarchy
comprising 2.6M types. We use the ‘sitelinks’ count for each item as a proxy for its popularity.</p>
      <p>
        Joint Inference: The core entity linker uses limited context while generating and ranking candidate
entities for table cell strings. In a well-formed (simple/flattened) table, there are several
relations/interactions between its constituent header and data cells, which can be leveraged to
disambiguate entity assignments for table cells. These include (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) semantic type of headers influencing
the entities assigned to data cells in a column and vice-versa; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) semantic type of a header influencing
and being influenced by the semantic types of adjoining headers; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) entity assignment of a data
cells influencing and being influenced by the assignments of adjoining data cells in a row [24]. We
implement a joint inference component specifically focusing on collectively assigning entities to cells
in a table column. At the core of this component is a KG embeddings-based “agreement function” to
compute compatibility between the entities assigned to data cells in a column. KG embeddings [23] are
vector representations of nodes in a knowledge graph. We leverage existing pre-trained embeddings for
Wikidata provided via Wembedder [34]. It provides both pre-trained models as well as REST APIs for
accessing embeddings for Wikidata items and properties.
      </p>
      <p>The algorithm starts with the ranked list of candidates retrieved by the core linker for each data cell
in a column. For each candidate, we use the Wembedder API to retrieve its corresponding embedding
vector (if available). Once the vectors for all the candidates for all cells are retrieved, we perform
Kmeans clustering over the collection of vectors, producing a cluster assignment prediction for each
candidate vector. At this point, each cell is associated with an ordered list of candidate entities with
corresponding vectors and cluster IDs. The algorithm identifies the cluster ID (“mode cluster ID”) most
frequently associated with the top-ranked vectors across the cells. For each cell, the algorithm selects
the highest-ranked candidate entity that is assigned to the mode cluster ID. If no candidate entities are
assigned to the mode cluster ID, then the algorithm picks the first candidate entity for which a vector
was available. If none of the candidate entities have associated vectors, then it selects the first-ranked
candidate. The K-means clustering algorithm requires a key input indicating the target number of
clusters. The joint inference algorithm cannot be applied to all tables - it works best when our system
is able to identify a column in a flattened table. In cases where joint inference is not applicable, the
module chooses the top-ranked candidate returned by the core linker.</p>
    </sec>
    <sec id="sec-6">
      <title>Evaluations</title>
      <p>The column “# Anno.” indicates the number of tables annotated for use in evaluating the
characterization accuracy for each label. Precision (“Pr.”) for all labels is relatively high, indicating that
if the system generates a label, it is likely to be correct. We observed lower recall (“Re.”) for labels
related to concise body, header columns and main classification (related to table orientation), indicating
areas where the system fails to detect an expected label. It is often challenging to detect a concise table
body because a table’s appearance often does not match its underlying formatting (specifically, a cell
that appears to be merged is often formatted as a set of cells, some of which are empty). The system
often misses identifying horizontal and vertical tables if it has no way to differentiate column or row
data types (e.g., if every column is a number), or if it cannot characterize columns due to the table being
concise. It is challenging to identify header columns if the algorithm cannot detect anything different
about the header column (e.g., all bold, or the only string column in the table).</p>
      <p>Syntactic Parsing and Semantic Interpretation Evaluations: We manually annotated 47 tables
drawn at random from 45 PMC articles. Each table cell was mapped to one of: Wikidata item/property
(henceforth referred to as Wikidata), Literal, Reference, Clinical Trial IDs, Index (header cells
representing index columns in a table), or NA (indicating that it should be mapped to a concept/entity,
but none exists in the reference KG). NA was also used in a small fraction of cases where the table cell
composed of multiple entities (e.g., “Maine, Ohio, California” as single cell value). A total of 3600
table cells were annotated. Unlike web tables where most cells represent entities, in our dataset, a
majority of the table cells (~70%) are literals, followed by Wikidata annotations (~25%) as the
secondmost frequent type. We use this annotated data to evaluate the syntactic parsing and semantic
interpretation modules’ capabilities. Joint inference was evaluated using Wembedder’s pre-trained
model [34] over a June 2017 snapshot of Wikidata, NUM_CLUSTERS parameter set to 4, and # of
initial candidates retrieved by the core entity linker set to 25.</p>
      <p>Table 4 shows the breakdown for precision (Pr.), recall (Re.), and F-score for the different types
predicted by the system. We skip reporting on NA and index as they are not supported by our system
yet. Rule-based methods to detect basic types, and specialist-identified semantic types achieve high
precision for detecting Literals, Clinical Trial IDs, and References. Though, as is typical with
rulebased methods, recall gets impacted if patterns do not have sufficient coverage (as evidenced in the case
of Trial ID). The numbers for the semantic interpretation module to produce accurate Wikidata links
appear lower on the surface.</p>
    </sec>
    <sec id="sec-7">
      <title>Triple Generation and Knowledge Graph Construction</title>
      <p>To model tables in scientific documents and their annotations, we developed an ontology to represent
metadata from PMC documents along with the data extracted from their constituent tables and their
inferred semantic meaning. This ontology builds upon a model [35] which, in turn, extends a subset of
the standard W3C PROV ontology. We subclass the core ENTITY class from PROV and extend it to
represent PMC documents and tables. The Triple Generation module takes the output from the Table
Characterization and Semantic Interpretation modules and represents the data as RDF triples grounded
in our ontology. Specifically, it includes a pipeline that auto-generates triples as follows: First, it creates
triples at the document level (e.g., title) followed by creation and linking of the document to instances
representing its authors, publishers, and sponsors. Next, it creates triples for tables in the document,
starting at the table level (e.g., number of rows) down to the cell level (e.g., parsing minimum and
maximum numeric values from a cell containing a range). It then adds triples for the Wikidata
assignment for cells as returned by the Semantic Interpretation module. The number of triples generated
per document varies widely according to its contents but is on the order of 1000 per PMC article. These
triples are persisted in a knowledge graph for table discovery purposes.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Search and Discovery of Scientific Tables</title>
      <p>We developed a framework for enhanced search and discovery of relevant information across
scientific tables captured in a knowledge graph, and for organizing this information in the form of
tabular responses. This section presents a preliminary evaluation of a prototype discovery engine that
uses the inferred semantics of tables to serve table-based semantic search requests.</p>
      <p>Hybrid Table-based Semantic Search: When searching over tables in scientific documents, a
meaningful response is often predicated on greater query expressivity and semantic resolution of user
requests than can be provided by a simple keyword search. To this end, we provide a hybrid search
experience that supplements keyword search with the ability for users to figuratively ‘sketch’ their
desired (relational) result table—including specifying semantic markups of its header/data cells, along
with constraints around its contextual metadata. This ensuing table-based semantic search request is
systematically translated into queries to retrieve appropriate data from the knowledge graph.</p>
      <p>We first parse any table-based discovery request into an intermediate query plan where both tables
and operations on tables are first-class citizens. Inspired by relational database systems, our query plans
are composed of primitive operators (SELECT, FILTER, RANK, and FUSE) to process scientific tables
expressed as linked data objects in a knowledge graph. The SELECT operator simply returns a list of
identifiers for all tables in our graph that match with the discovery request. The FILTER operator
prunes this list by applying one or more contextual constraints on the tables. RANK orthogonally
reorders the list of tables based on some ranking criteria.</p>
      <p>Discovery Engine and Semantics-guided
Table Discovery: We built a prototype
discovery engine that takes as input an
intermediate query plan and incrementally
constructs and executes appropriate SPARQL
queries against a knowledge graph of scientific
tables. This is accomplished by adding or
modifying ad hoc query clauses corresponding
to each operator instance in a query plan. This
engine is also responsible for taking the results
of SPARQL query execution and packaging
them in the form of relational result tables for
easy consumption.</p>
      <p>Discovery of relevant tabular information involves matching the semantics of a list of search terms
from a user request against the appropriate inferred semantics of tables and filtering the matches to
include only those tables that meet some similarity criteria. Consider a simplified search request around
latest findings on effective treatments for COVID-19 (see Figure 3). The matching algorithm underlying
the SELECT operator takes the semantics (in this example, Wikidata identifiers or QIDs) associated
with header cells in the table-based search request and compares them against those associated with
header cells for each table in the graph. If the two sets of QIDs overlap completely, or if the set overlap
exceeds a threshold as established by some query constraint (e.g., min_coverage=2), then the
candidate table is included in the search results. By default, RANK sorts the list of candidates based on
this header-cell coverage metric (i.e., tables with maximum set overlap are ranked higher).</p>
      <p>Evaluation: We evaluated the ability of our discovery engine to return a ranked list of relevant tables
corresponding to a search request – here, a request comprises a desired list of entities that a result table
header is expected to contain, and relevancy is based on a semantic match between this list and the
header cells of tables. Specifically, we carried out a preliminary evaluation against a knowledge graph
comprising a subset of 118 tables drawn at random from our full set of tables, and manually examined
for relevance to a query set. Our query set includes 5 queries motivated by realistic scenarios, where
each query is a list of QIDs corresponding to three desired entities. We compute Average Precision
(AP): the mean of precision scores at each point where a relevant result is returned, and Recall: the ratio
of relevant results to the expected number of relevant results based on a manual search. We also
compute Mean Average Precision (MAP) scores across all queries.</p>
      <p>As seen in Table 5, overall precision scores are relatively low across all queries, whereas all queries
demonstrate high recall. We are exploring hybrid approaches to better match against header cells,
including going beyond strict QID matches to also compare against a hierarchy of an entities’
supertypes, and semantic similarity based on pre-trained Wikidata embeddings from Wembedder [34]. With
scientific tables, the ideal response to a search request is one that merges relevant information from
tables potentially drawn from multiple documents. To this end, our engine also supports a FUSE
operator that can union the rows of two or more semantically-compatible candidate tables on the fly
and generate a fused table in response to the request (details beyond the scope of this paper).</p>
    </sec>
    <sec id="sec-9">
      <title>6. Conclusions</title>
      <p>The discovery of relevant information from scientific documents, till date, has predominantly
focused on unstructured text, ignoring crucial information encoded in structured artifacts such as tables.
In this paper, we demonstrated preliminary prototype systems for populating and searching over a
knowledge graph of scientific tables, enabling semantics-guided discovery from these structured
artifacts. Our two-stage table interpretation methodology highlighted the vast structural and syntactical
heterogeneity prevalent in scientific tables by characterizing over 120,000 tables from PMC articles.
Preliminary evaluations of the characterization and semantic interpretation modules on a subset of these
tables show promising results. While the characterization is a high precision module, it’s recall could
be improved for certain labels by including additional rules or exploring the development of supervised
algorithms. Semantic interpretation and disambiguation performance is good when our system is able
to discover the appropriate entity in its ranked candidate result set but fails otherwise. Candidate search
needs to be further adapted to the domain strings’ idiomatic nature. Entity linking systems have
traditionally focused on linking strings to nodes in a knowledge graph, but in scientific tables—where
a majority of the cells are literals—they also need to detect and learn to not link literals. Distinguishing
between a literal and an entity string will be a key driver for this. Our preliminary evaluation of table
discovery, based solely on strict header-based semantic matching, showed low average precision but
high recall. Enhanced semantic matching of tabular content and the system’s ability to fuse knowledge
from multiple tables on the fly will drive a successful discovery experience over scientific documents
in the future. In this paper, we believe we have taken initial steps towards developing a scalable and
practical scientific table-centric knowledge graph population and search system.</p>
    </sec>
    <sec id="sec-10">
      <title>7. Acknowledgements</title>
      <p>This research is based on work supported in part by the Office of the Director of National Intelligence
(ODNI), Intelligence Advanced Research Projects Activity (IARPA), via [2021-21022600004]. The
views and conclusions contained herein are those of the authors and should not be interpreted as
necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S.
Government.</p>
    </sec>
    <sec id="sec-11">
      <title>8. References</title>
      <p>[7] PMC Open Access Subset, Bethesda, MD: National Library of Medicine, 2003.</p>
      <p>https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (last accessed: April 2023)
[8] Denny Vrandecic and Markus Krotzsch, Wikidata: A Free Collaborative Knowledgebase,</p>
      <p>Communications of the ACM, Vol. 57(10), pp. 78-85, 2014. doi: 10.1145/2629489.
[9] V. Mulwad, et al., A Practical Entity Linking System for Tables in Scientific Literature, in:</p>
      <p>Proceedings of the 3rd AAAI Workshop on Scientific Document Understanding, 2023. To appear.
[10] M. Cafarella, A. Halevy, H. Lee, J. Madhavan, C. Yu, D. Wang, E. Wu, Ten Years of WebTables,
in: Proceedings of VLDB Endowment, 11(12): 2140-2149, 2018. doi: 10.14778/3229863.3240492
[11] Shuo Zhang and Krisztian Balog, Web Table Extraction, Retrieval and Augmentation: A Survey,</p>
      <p>
        ACM Trans. on Intelligent Systems and Technology, 11(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ): 1-35, 2020. doi: 10.1145/3372117.
[12] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L. Ibanez, E. Kacprzak, P. Groth, Dataset
      </p>
      <p>Search: A Survey, The VLDB Journal, 29, pp. 251-272, 2020. doi: 10.1007/s00778-019-00564-x
[13] N. Noy, et al., Google Dataset Search: Building a Search Engine for Datasets in an Open Web</p>
      <p>Ecosystem, in: Proceedings of the 28th World Wide Web Conference, pp: 1365-75, 2019.
[14] S. Zhang and K. Balog, On-the-fly Table Generation, in: Proceedings of SIGIR, pp: 595-604, 2018.
[15] G. Badaro, M. Saeed and P. Papotti, Transformers for Tabular Data Representation: A Survey of</p>
      <p>Models and Applications, Eurecom Technical Report 6721, 2021.
[16] A Comprehensive Paper List of Reasoning over Tables, 2022, URL:
https://github.com/yilunzhao/Awsome-Table-Reasoning (last accessed: April 2023)
[17] B. Smock, R. Pesala, R. Abraham, PubTables-1M: Towards Comprehensive Table Extraction from</p>
      <p>Unstructured Documents, in: Proceedings of CVPR, 2022. doi: 10.1109/CVPR52688.2022.00459.
[18] Z. Zhai, et. al., ChemTables: A Dataset for Semantic Classification on Tables in Chemical Patents,</p>
      <p>Journal of Cheminformatics, 13(97), 2021. doi: 10.1186/s13321-021-00568-2.
[19] M. Kardas, et al., AxCell: Automatic Extraction of Results from Machine Learning Papers, in:</p>
      <p>Proceedings of EMNLP, pp. 8580-8594, 2020. doi: 10.18653/v1/2020.emnlp-main.692.
[20] N. S. Moosavi, et al., SciGen: A Dataset for Reasoning-Aware Text Generation from Scientific</p>
      <p>Tables, in Proceedings of 35th Conference on NeurIPS, Track on Datasets and Benchmarks, 2021.
[21] N. Abdelmageed, et al., Results of SemTab 2022, in: Vol. 3320 of CEUR Workshop Proceedings
—Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), 2022.
[22] J. Liu, et al., From tabular data to knowledge graphs: A survey of semantic table interpretation
tasks and methods, Journal of Web Semantics, Vol. 76, 2023. doi: 10.1016/j.websem.2022.100761
[23] J. Liu, et al., Radar Station: Using KG Embeddings for Semantic Table Interpretation and Entity</p>
      <p>Disambiguation, in Proceedings of ISWC, pp. 498-515, 2022. doi:10.1007/978-3-031-19433-7_29
[24] V. Mulwad, T. Finin, A. Joshi, Semantic message passing for generating linked data from tables,
in: Proceedings of ISWC 2013, LNCS vol. 8218, doi: 10.1007/978-3-642-41335-3_23.
[25] N. Milosevic, et al.: A framework for information extraction from tables in biomedical literature.</p>
      <p>
        Intl. Journal on Document Analysis and Recognition (IJDAR) 22(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 55–78, 2019.
[26] G. Singh, et al.: Qtltableminer++: semantic mining of qtl tables in scientific articles, BMC
      </p>
      <p>Bioinformatics, 19(183), 1–11, 2018. doi: 10.1186/s12859-018-2165-7.
[27] O. Corcho, A review of ontologies for describing scholarly and scientific documents, 2014.
[28] E. Crestan, P. Pantel, Web-scale table census and classification, in Proceedings of 4th ACM Intl.</p>
      <p>Conf. on WSDM. pp. 545–554, 2011. doi: 10.1145/1935826.1935904.
[29] L. R. Lautert, M. M. Scheidt, C. F. Dorneles, Web Table Taxonomy and Formalization, ACM</p>
      <p>
        SIGMOD Record 42(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 28–33, 2013. doi: 10.1145/2536669.2536674.
[30] J. Fang, et al., Table header detection and classification, in: Proceedings of 26th AAAI Conference
on Artificial Intelligence, 26(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), pp. 599-605, 2012. doi: 10.1609/aaai.v26i1.8206.
[31] N. Milosevic, et al., Disentangling the Structure of Tables in Scientific Literature, in: Proceedings
of NLDB, LNCS vol. 9612, pp. 162-174, 2016. doi: 10.1007/978-3-319-41754-7_14.
[32] N. Milosevic, et al., Extracting patient data from tables in clinical literature—case study on
extraction of bmi, weight and number of patients, in: HEALTHINF. pp. 223–228, 2016.
[33] P. Nimbalkar, et al., Semantic interpretation of structured log files, in: Proceedings of 17th Intl.
      </p>
      <p>Conf. on Information Reuse and Integration (IRI). pp. 549–555, 2016. doi: 10.1109/IRI.2016.81.
[34] F. A. Nielsen, Wembedder: Wikidata Entity Embedding Web Service, Preprint, arXiv:1710.04099.
[35] Provenance Modeling in RACK, 2021. URL:
https://github.com/ge-highassurance/RACK/wiki/RACK-Data-Model#provenance-modeling-in-rack (last accessed: 04/23)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Holly</given-names>
            <surname>Else</surname>
          </string-name>
          ,
          <article-title>How a Torrent of COVID Science changed Research Publishing-in seven charts</article-title>
          ,
          <source>Nature</source>
          , Vol.
          <volume>588</volume>
          (
          <issue>553</issue>
          ),
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1038/d41586-020-03564-y.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Carrigan</surname>
          </string-name>
          ,
          <source>Generative AI and the Unceasing Acceleration of Academic Writing</source>
          ,
          <year>2023</year>
          . URL: https://blogs.lse.ac.uk/impactofsocialsciences/2023/03/14/generative-ai
          <article-title>-and-the-unceasingacceleration-of-academic-writing/ (last accessed:</article-title>
          <source>April</source>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang. K. Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reas</surname>
          </string-name>
          , et al.,
          <source>CORD-19: The COVID-19 Open Research Dataset, in: Proceedings of Workshop on NLP for COVID-19 at ACL Conference</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          , et al.,
          <article-title>BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining</article-title>
          , Briefings in Bioinformatics,
          <volume>23</volume>
          (
          <issue>6</issue>
          ),
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1093/bib/bbac409.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>ChatPDF-Chat with any</surname>
            <given-names>PDF</given-names>
          </string-name>
          !,
          <year>2023</year>
          , URL: https://www.chatpdf.com/ (last accessed:
          <source>April</source>
          <year>2023</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ought</surname>
          </string-name>
          , Elicit: The AI Research Assistant,
          <year>2022</year>
          . URL: https://elicit.org (last accessed:
          <source>April</source>
          <year>2023</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>