=Paper=
{{Paper
|id=Vol-1718/paper6
|storemode=property
|title=Extracting Protein-Reaction Information from Tables of Unpredictable Format
and Content in the Molecular Biology Literature 
|pdfUrl=https://ceur-ws.org/Vol-1718/paper6.pdf
|volume=Vol-1718
|authors=Sam Sloate,Vincent Hsiao,Nina Charness,Ethan Lowman,Chris Maxey,Guannan Ren,Nathan Fields,Leora Morgenstern
|dblpUrl=https://dblp.org/rec/conf/ijcai/SloateHCLMRFM16
}}
==Extracting Protein-Reaction Information from Tables of Unpredictable Format
and Content in the Molecular Biology Literature ==
<pdf width="1500px">https://ceur-ws.org/Vol-1718/paper6.pdf</pdf>
<pre>
      Extracting Protein-Reaction Information from Tables of Unpredictable Format
                    and Content in the Molecular Biology Literature
Sam Sloate, Vincent Hsiao, Nina Charness, Ethan Lowman, Chris Maxey, Guannan Ren, Nathan Fields,
                                      and Leora Morgenstern
                         Corresponding author: leora.morgenstern@leidos.com
                                                Leidos
                                        Arlington, VA 22203

                           Abstract                                common, when they exist they can contain many times the
                                                                   amount of information as text — sometimes thousands or
       Tables in technical papers often provide much use-          tens of thousands of PFs — most of which is not found in
       ful information that is not present in text. This pa-       text. Thus, not reading tables would result in missing large
       per focuses on the specific problem of automated            amounts of PRI. Moreover, tables often give richer informa-
       table reading (ATR): automating the interpreta-             tion than is present in text, e.g., details about the site of a
       tion and extraction of information from protein-            reaction and measurements of increase or decrease of phos-
       reaction information (PRI) tables in the molecular          phates in the substrate.
       biology literature (MBL), in the context of a gen-
                                                                      This paper describes a general approach for automating
       eral knowledge-based approach to automating table
                                                                   the extraction of information for tables based on automating
       reading.We report on the results of our system eval-
                                                                   the mapping of set of columns of a pre-determined relational
       uation, which demonstrated a precision of greater
                                                                   schema to sets of columns in a table, and our implementation
       than .9 in identifying relevant tables and .8 in map-
                                                                   of this method for automating the extraction of PRI from ta-
       ping tables to correct relational schema.
                                                                   bles. Although reading tables avoids many of the difficulties
                                                                   of natural language processing (NLP), there are many other
  1    Introduction                                                difficulties to be solved, most saliently that
  The work reported in this paper was motivated by and per-        (1) There is no standard schema for representing PRI in ta-
  formed for the DARPA Big Mechanism research project,             bles. Indeed, we have discovered hundreds of schemas in
  which aims to build a large model of human cancer signal-        MBL tables. This means that (2) It is difficult to deter-
  ing pathways that could potentially be used for hypothesiz-      mine whether an MBL table is a PRI table. (3) There is
  ing novel causal relationships that might inform cancer treat-   no standard representation used for specific types of infor-
  ments. An initial step in constructing the model is collect-     mation, e.g., types of phosphorylation and site information.
  ing all human molecular pathway fragments that have been         (4) Much of the information in PRI tables, including the par-
  published in the literature; these fragments would then be as-   ticipants of a reaction (the relevant entities of a pathway frag-
  sembled into larger pathways relative to such considerations     ment), is not explicit in text, but must be derived from other
  as the context of the experiment in which the pathway frag-      sources, including surrounding context. Some of these is-
  ments were observed, and the strength of the evidence.           sues have been noted, though not solved, by previous research
     The largest existing human cancer signaling pathway           to automate table reading [Pimplikar and Sarawagi, 2012;
  database, Pathway Commons [Cerami et al., 2011], con-            Cafarella et al., 2008; 2009]; some arise from the specific
  tains information about tens of thousands of manually curated    domain studied, though as we point out in the conclusion,
  pathway fragments (PFs) and associated protein reactions         analogous problems exist for tables in other highly technical
  (PRs) that are imported from databases like Reactome [Croft      or specific domains.
  et al., 2014]. Pathway Commons is estimated to contain only
  1% of the human PFs and protein reaction information (PRI)       1.1   Scope
  that exist in the molecular biological literature (MBL). Given   Much research on automated table reading (ATR) [Hurst,
  the ever-increasing number of papers published containing in-    2000; Wang and Hu, 2002] has focused on the problem of
  formation about PFs and PRs (more than 10K per year), it is      table detection and on labeling rows and columns, since per-
  likely that the gap between pathway databases like PC and the    forming these tasks is generally a prerequisite to automated
  body of information published in MBL will continue to grow,      table reading. This is not the focus of this paper. All tables
  unless reading these pathways and PRI is to some extent au-      in this study were found in papers retrieved from the PubMed
  tomated.                                                         PMC website at http://www.ncbi.nlm.nih.gov/pmc/, and we
     Although much PRI can be retrieved from the text of MBL       were therefore able to take advantage of the format in which
  papers [Valenzuela-Escarcega et al., 2015], tables are a par-    articles and tables are represented on that site. For articles
  ticularly valuable source of PRI. While PRI tables are not       in that website, we have developed methods for reading both
tables included in the body of an article (in either HTML or
NXML) and tables included in supplementary material (these
are generally in Excel format). Although there are interesting
technical issues which we needed to solve to do this work –
e.g., determining row and column content in tables in which
there are subheaders spanning multiple columns; and deter-
mining table extent, particularly for Excel tables in which
cells adjacent to tables are used to represent other sorts of
information – these are not within this paper’s scope. A sep-
arate document describes these results.

2       Motivating Examples, Technical Challenges
We define the general automated table reading problem as                       Figure 1: Table 1 fragment from PMCID 2962495
follows: Given a target relational schema R(x1 . . . xn) and a
table with columns T (c1 . . . cm), can we determine a a map-             name for a protein, while the second gives the corresponding
ping between subsets of x1 . . . xn and subsets of c1 . . . cm,           Uniprot identifier – for Participant B. (If these columns were
and extract information corresponding to the mapped sub-                  not synonymous, it would be reasonable to hypothesize, sub-
sets? Note first, that we are looking to map subsets onto                 ject to some checks, that the two columns represented Partic-
subsets rather than individual columns because a table can                ipants A and B. The third column gives site information (S),
spread out information over several columns or compress sev-              and the fourth column gives quantified information (Q).
eral columns into one; and second, that this definition can be
                                                                             This example demonstrates one of the difficulties of auto-
generalized to multiple relational schema: most tables will
                                                                          mated table reading: many desired pieces of information are
map to at most one relational schema, but some tables can
                                                                          not explicit in tables, but must be inferred and/or extracted
map to several schemas.
                                                                          from other sources. In this example, Participant A can be read
   For the particular problem of reading PRI tables, this prob-           from the table title, which gives CSF-1R as the protein acting
lem can be stated in the following domain-specific way: From              on the substrate shown in columns 1 and 2. The table title
a given table, can we extract relation instances of the form              also gives M, the type of modification, in this case (tyrosine)
R(A,B,I,M,S,Q,N) where A and B are Participants A (a pro-                 phosphorylation. This information can also be read from the
tein or chemical) and B (a protein, the substrate) in some                values in the third column: a subset of the letters (S,T,Y) (cor-
interaction; I is the reaction or interaction type itself (e.g.,          responding to three common types of residue in phosphory-
increases or decreases activity); M is the type of modification           lation), or the letter P, in a column that gives site information,
(e.g., phosphorylation or acetylation); S is the site at which            are two common ways of indicating that the modification is
the reaction takes place; Q is quantified information, e.g., the          phosphorylation. N can be deduced from the values in the Q
increase in molecules in the substrate, most often represented            column: in this case, the fold value is always high; thus all
as a multiplier (fold change) or ratio, or the log of the ra-             lines in the table give positive information. One can also in-
tio; and N tells whether or not the information is negative.              fer the reaction type I, in this case increases activity, from the
Other teams on the project, whose systems read text rather                fold change.
than tables, looked for all these fields except for Q, which is
                                                                             An added complexity is that often information about Par-
generally not present in text. Q gives important information
                                                                          ticipant A is bundled with Q. For example, in Figure 2, the
on the magnitude of the modification induced by a reaction;
                                                                          rightmost four columns give quantified information: the sub-
this can often be used to tell if the modification is considered
                                                                          header expresses the fact that these columns gives the log of
to be significant. E.g., if the ratio is between .6 and 1.6, the
                                                                          the ratio of the change of molecules in the substrate, com-
reaction is considered by many researchers to have little ef-
                                                                          paring dDAVP (desmopressin) to a control. That is, dDAVP
fect. One can interpret the corresponding line of the table as
                                                                          is Participant A. (dDAVP’s role as Participant A can also be
containing negative information about the interaction; that is,
                                                                          read from the table title, not shown; but this is not gener-
it has not been shown to have a significant outcome.
                                                                          ally the case when it also coupled with Q information.) Note
   It is rare that all information for all fields is present within a     also that the single column type / argument Q is given over
single table, though often much information can be inferred.              four columns, as a time-series. Of special interest are cases
   The technical challenges we faced are best understood by               (e.g., 2nd line) where dDAVP initially inhibits the reaction,
examining several sample tables. Consider Figure 1, which                 although after a few minutes, it activates the reaction. Should
shows a fragment from one of the simplest of the PRI ta-                  this be considered an activation or an inhibition? Our ap-
bles in the MBL. There are four columns shown in this frag-               proach is to check for the greatest absolute value of change
ment. 1 The first two columns are synonyms – the first gives a            and to assign the reaction type based on that value, but it is
    1
     In the full table, there are six columns. The fifth column con-      clear that extracting the desired information from the table is
tains information similar to the fourth column but for a different cell   not trivial.
line. The sixth column gives information about whether a substrate           Figure 3 shows a table with positive information, negative
is an src-inhibitor. Both provide useful data, but such information is    information, and for some lines of the table, a lack of infor-
out of scope for this paper.                                              mation. Note also that in this table, a single column contains
                                                                      We note that if successful, a rule-based system could be
                                                                   used to automate the finding and labeling of training data;
                                                                   that is, it would enable distant-supervision of learning.
                                                                      The system’s architecture is shown in Figure 5.

                                                                   3.1   Determining relevance
                                                                   To be relevant, a table does not have to contain all information
                                                                   needed for a complete relation instance of R(A,B,I,M,S,Q,N).
                                                                   For PRI, a table should at least contain the following: (i) ev-
                                                                   idence of two non-synonymous proteins, (ii) evidence of a
     Figure 2: Table 1 fragment from PMCID 3277771                 protein reaction (such as a post-translational modification),
                                                                   and (iii) some quantified information concerning the reaction.
both site and kinase information (these must be teased apart in    As discussed, these pieces of information can often be com-
order to populate relation instances) and columns containing       bined to obtain further elements of the relation instance, but
free text may appear in the table.                                 even if that cannot be done, the combination of such evidence
   Figure 4 shows a typical Excel table; such tables are often     makes the likelihood that a table is PRI-relevant quite high.
found in a paper’s supplementary materials. While Excel ta-           Given this smaller set of required information, the diffi-
bles often have titles or other information in an area of the      culty reduces to finding and recognizing (i), (ii), and (iii). We
spreadsheet outside the bounds of the table, or on the tab of      need to look for evidence in (a)columns, (b) column headers,
the spreadsheet, this file has neither. The link in the main       and (c) table titles/captions, as well as (d) possibly text out-
paper pointing to this spreadsheet contains some hints about       side the table as well. Together, (a) and (b) constitute column
the table’s contents. However, it is extremely difficult even      identification.
for a human to understand this table without reading much
of the paper. E.g., how could a human find any evidence of         3.2   Column identification
Participant A? Note also that there appear to be multiple ex-      Rules for identifying columns are based on domain knowl-
periments, or at least multiple modes of measurement that are      edge. For PRI tables, they include the following:
reported in this table.
                                                                   Proteins
3   Technical Approach, Architecture                               Columns of proteins, which almost always consist of in-
                                                                   stances of Participant B, are generally easy to identify. Typi-
In general, automating the extraction of information from ta-      cally protein columns are labeled with the name of the protein
bles relative to a particular schema requires succeeding at two    database whose nomenclature is being used; even without the
tasks: 1) Automating the identification of relevant tables and     column label, regular expressions can be used to identify en-
2) Automating the correct extraction of information from ta-       tries as proteins. The task can be complicated by cell en-
bles marked as relevant.                                           tries that consist of multiple proteins, or proteins along with
   Our research and experimentation has shown that only a          other entities. Sometimes proteins are given as peptide subse-
small fraction of tables collected from the PMC site are PRI       quences; this often allows combining information about mod-
tables. We estimate that 2-3% of papers filtered against a         ification type, site, and Participant B.
set of search terms corresponding to protein reactions contain
PRI tables, and that 1% of papers on the PMC site are PRI          Modification type
tables. For example, clinical tables or tables reporting on cell   This information is often explicitly in column headers or ta-
functionality would be unlikely to have PRI even if the papers     ble titles. Related terms may be used: e.g., “kinase” indicates
themselves make some mention of protein reactions. Such            that the modification is phosphorylation. The most common
rareness does not negate the usefulness of automating PRI          residues on which phosphorylation occurs, serine, tyrosine,
table reading: when PRI tables are found, they often have          and threonine, are represented by several common sets of ab-
hundreds or even tens of thousands of lines of data that are       breviations; their appearance in a column indicates phospho-
not present in text.                                               rylation. Marking the site at which a modification occurs by
   However, it did suggest that statistical classification ap-     a differently-cased letter indicating the modification (e.g., m
proaches, as used by [Pimplikar and Sarawagi, 2012] in             for methylation, p for phosphorylation) also gives useful in-
querying tables, were unlikely to be useful as an initial strat-   formation.
egy: it would be too difficult to find (even using the help
of Elasticsearch) and label even the relatively small training     Site Information
sets required by simple Support Vector Machines. Instead,          Site information is often combined with information indicat-
we opted to develop a rule-based approach. In any case, we         ing modification, as in the NCBI site column in Figure 1 or
opted to use an approach consistent with that of University        the Phosphosite column in Figure 2. As the Site (putative
of Arizona [Valenzuela-Escarcega et al., 2015], another per-       kinase) column in Figure 3 shows, site information is often
former in Big Mechanism. They have used a rule-based sys-          mixed with other specific information in a single table cell.
tem which has excelled in precision and throughput in ex-          In addition, multiple sites are often furnished within a single
tracting PRI from MBL texts.                                       cell.
                                      Figure 3: Table 1 fragment from PMCID 1459033


                                     Figure 4: Table 3 fragment (Excel), PMCID 3229182


                                                                 umn headers could still be improved. In more recent work,
                                                                 we have been exploring the use of clustering to help identify
                                                                 related column headers.

                                                                 3.3   Discovering Participant A
                                                                 It is almost never the case that Participant A is included in its
                                                                 own column in a table. This is mostly because of the nature
                                                                 of experiments in this domain: one catalyst is tested on mul-
                                                                 tiple substrates. We use several strategies to find Participant
                                                                 A: 1. Look for mention of a suitable entity in the table title
                                                                 or caption. Assuming, however, that any protein or chemical
                                                                 mentioned in the title must be Participant A leads to lowered
                                                                 precision. 2. Look for mention of a suitable entity after a
               Figure 5: System Architecture                     “log” term in a column header. This strategy is also error
                                                                 prone. What follows the “log” is often a short phrase that
                                                                 describes in some way the nature of the experiment; men-
Quantified Information                                           tion of Participant A is not the only possibility. 3. Scan the
Quantified information is displayed as columns of real num-      paper title, introductory paragraph of the paper, and/or first
bers or real-numbered intervals. It is often difficult to tell   paragraph of the methods and material section for mention
from inspection of the column content that these real num-       of suitable entities. This strategy is also error prone, since
bers describe increases or decreases in molecular activity in    there will typically be multiple proteins mentioned in such
substrates relative to a control. This information must al-      paragraphs.
most always be obtained from the column header itself. What          These strategies are best understood as ways of generat-
makes obtaining this information difficult is that there ap-     ing hypotheses about Participant A; corroborating evidence
pear to be scores of ways of labeling the header of this sort    is supplied if other strategies also lead to the same Participant
of column. Some examples include KO/WT (knock-out /              A. Once a Participant A is hypothesized, the system further
wild-type), indicating that one is testing how knocking out      checks its hypothesis by iterating on the contents of the col-
a gene affects a modification relative to the wild-type con-     umn that has been identified as containing instances of Par-
trol, KD/WT (knockdown/wild type), H/V (hormone / vehi-          ticipant B. It checks the text of the paper for sentences of the
cle), ratio, fold-value, multiplier, and many column headers     form A [modifies] B or B is [modified] by A. Any such in-
beginning with “log.” Recognizing all instances of such col-     stances further raise the confidence metric that Participant A
has been hypothesized correctly.                                     4.2   Evaluation of Question 1
                                                                     Human Gold Standard Development
3.4    Other Information and Inference
                                                                     Two team members with knowledge of molecular biology and
Often the nature of a reaction – e.g., whether it increases or in-   who had not been involved in development of the table read-
hibits activity – is mentioned directly in a table title. Such in-   ing system created a Human Gold Standard for Question 1.
formation can also be inferred from Q, e.g., by noting whether       First, they were shown examples of relevant and irrelevant
a ratio is greater or less than 1.                                   papers from the training corpus. Then, they were given the
   Whether information is negative must also often be in-            test set of “phosphoproteomics” papers, trailing digits 7 and
ferred, especially when time-series information is given,            0. There were 515 papers in this test set and 977 tables. Using
                                                                     Elasticsearch, the Human Gold Standard developers searched
3.5    Extracting information                                        for terms that were similar to those in the examples of rele-
The end goal of our system is extracting information on pro-         vant papers in the training corpus.
tein reactions that can be used by knowledge bases like Path-            The two team members worked separately. Inter-annotator
way Commons. To that end, when a table is determined to be           agreement was high, near 90%. The few tables that they did
relevant, all information corresponding to fields in the target      not agree on were discarded. The size of the Human Gold
schema is extracted into a BioPAX-consistent format [Rod-            Standard was limited by the number of relevant tables that
chenkov et al., 2013]. Proteins are converted to their equiva-       these team members could find (30). They had similar pat-
lent in Uniprot whenever possible.                                   terns of being able to find some relevant tables quickly (the
                                                                     equivalent of low-hanging fruit), and then slwoing down un-
                                                                     til they got to the point that it was too frustrating to continue.
4     Evaluation                                                     The Human Gold Standard thus consisted of 30 relevant ta-
4.1    Training and Test Sets                                        bles and 30 irrelevant tables.
We formed a training and test corpus of papers and tables by         Results
entering the search term “phosphoproteomics” into the PMC            For the Question 1 test, the system was input the 60 tables of
search engine at http://www.ncbi.nlm.nih.gov/pmc/ in May             the Human Gold Standard and labeled each table as relevant
2015. The term “phosphoproteomics” was chosen to increase            or irrelevant. It achieved precision of .93 and recall of .5, for
the likelihood that papers returned would include PRI tables.        an F1 score of .65. This is statistically significantly better
Around 3-4% of papers returned contained at least one PRI            than random.
table. Thus, the task of finding PRI-relevant tables was still          The scores were consistent with (though a bit lower than)
very difficult; however, it made it somewhat easier for hu-          earlier testing on the internal test set, trailing digits 8 and 9.
mans to create a gold standard with PRI-relevant tables.             Precision was consistently excellent (usually perfect), while
   Of the more than 2200 PMCIDs (corresponding to papers)            recall hovered between .5 and .6.
that were returned, we used papers with trailing digits 1-6 as
a training set. We reserved papers with trailing digits 8 and 9      Evaluation of Question 2
for internal testing purposes, and reserved papers with trailing     The system ran on the “phosphoproteomics” corpus, trailing
digits 7 and 0 for the test whose results would be reported to       digits 7 and 0 (515 papers, 977 tables) and on the corpus sup-
DARPA.(We consulted with PubMed and PMC to ascertain                 plied by the government evaluation team (1000 documents,
that trailing digits of papers are assigned randomly, so that        646 tables). The system labeled 30 tables from the first test
no bias was introduced in thus creating our training and test        set as relevant and extracted more than 12,000 protein reac-
sets.) We did not touch any of the papers in the reserved set,       tions. Note that the system labeled 30 tables as relevant, even
either processing them or manually inspecting them, until the        though of the 30 relevant tables in the Human Gold Standard,
time of the test. (Automated preprocessing to scrape and find        it only labeled 15 as relevant. The reason for this is that the
tables can take several days, so this was done several days          system was able to find relevant tables that the humans were
before the test was run.)                                            not able to find. (We checked to make sure that the system had
   We were also provided with a set of 1000 PMCIDs from              indeed found relevant tables.) Given the small percentage of
MITRE, the evaluation team for Big Mechanism, to be used             relevant tables in any corpus, and the difficulty of searching
as a test set. As with our own test sets, we did not process or      for such tables, even with Elasticsearch, this is not surprising.
inspect these until the time of the test.                            In other words, even given virtually unlimited time, human
   The test aimed to answer two questions:                           recall of (ability to find) relevant tables is no better than sys-
Question 1: How well could the system find relevant tables,          tem recall.
that is, tables with protein reaction information? Specifically,        The system labeled 11 tables from the second test set as
could it be shown that the system performed statistically sig-       relevant and extracted 585 protein reactions.
nificantly better than random?                                          We examined the 11 tables labeled as relevant. 10 were la-
Question 2: How accurately could the system extract infor-           beled correctly, but one of the tables was irrelevant, giving a
mation? That is, given a table, would the system extract in-         precision score of just under .91. To score the precision of the
formation correctly?                                                 information extracted, we followed the rubric that the govern-
   For Question 1, we aimed to compute precision, recall, and        ment evaluation team was using for text-reading systems. An
an F1 score. For Question 2, we focused on precision.                entirely correct schema mapping received 1 full point. Half
a point was deducted for an error (e.g., getting Participant A         The work of [Pimplikar and Sarawagi, 2012; Cafarella et
wrong or missing Participant A if it existed in the table (even     al., 2008; 2009] shows a direction that we would like to pur-
in a caption or column header), getting Participant B wrong,        sue: using statistical classification techniques to understand
or getting the modification type wrong.) Thus, any more than        table relations. We are currently exploring several clustering
one error resulted in a score of 0.                                 methods for this purpose.
   We examined each table to determine whether the system              The work of [Mulwad et al., 2014] is of particular interest.
had correctly mapped the subset of columns it selected onto         The authors aim to do meta-analysis of medical tables on the
the desired relational schema. We randomly selected three           web. Central to their approach is a mapping of the categories
lines of each table for inspection. In all cases, the three rows    that are found (e.g., in table metadata) to well-known ontolo-
agreed with one another and appeared to fairly represent the        gies such as DBPedia and SNOMED. We plan to model fu-
table. We scored .8 on precision of correct mapping to re-          ture research on this work, but note two ways ways in which
lational schema. Most of our errors resulted from missing           the authors’ work is different from ours. Most saliently, in
Participant A when it was in a caption or column header, or         order to interpret PRI tables, the system is not primarily con-
in getting Participant A wrong. Aside from Participant A, we        cerned with medical ontologies such as SNOMED; rather, it
achieved a precision of .95.                                        needs to process concepts of experimentation and measure-
                                                                    ment, as well as detailed concepts in molecular biology. Mul-
4.3    Subsequent Progress and Evaluations                          wad et al.’s work seems most suited to meta-analysis of clini-
Following the evaluation, we revamped the system with mul-          cal papers. Second, we focus on extraction, something that is
tiple objectives:                                                   absent in Mulwad et al.’s work. Despite these differences, we
   1. Making the code more efficient, so that we could run          would be interested in exploring connections between these
test-and-fix cycles more efficiently.                               approaches.
   2. Fixing bugs
   3. Improving recall by recognizing different types of post-      6   Current and Future Work; Generalization
translational modifications, as well as recognizing unusual
choices on the part of table designers. For example, in the
                                                                        to Other Domains
same way that authors often stuffed two types of information        We are improving on and extending this work in several di-
into one column (e.g., Participant A and quantified informa-        rections. First, we are working on improving recognition of
tion), they also separated into multiple columns information        Participant A by integrating this work with biomedical NLP
that might be expected to stay together, such as phosphoryla-       systems so that we can read more of the text that is relevant to
tion site and phosphorylation residue.                              Participant A. Merging text and table-reading systems could
   After updating our system, we conducted an expanded              have other advantages, including allowing cross-checking be-
search and collected more than 3500 PMCIDs, yielding more           tween different systems on closely related data.
than 400 relevant tables from 91 PMCIDs. We extracted more             Second, we are running the automated table reading sys-
than 120,000 protein interactions from these tables. We then        tem on larger corpora. In recent work, we have run the table
selected 9 tables from new PMCIDs that we had not previ-            reading system on 13K papers, yielding several dozen tables
ously inspected or processed in prior evaluations, and eval-        and 42K complete protein reactions. This not only shows that
uated precision. This time, we graded individual rows, and          our system is not just a toy system, but will afford us much
inspected more rows per table to make sure that we were not         data to analyze so that we can improve performance. The ex-
overlooking possible errors.                                        tracted information has the potential to significantly increase
   Our precision was 1.0 for Participant B, .89 for modifica-       the size of the Pathway Commons KB.
tion type, .93 for site information, and 1.0 for negative infor-       Third, we are working on generalizing the system to work
mation. We continued to do poorly in recognizing Participant        on different domains:
A, however, and this remains an area of current researh.            (1) Tables containing other information about proteins, in-
                                                                    cluding interactions between proteins and biological pro-
5     Related Work                                                  cesses and molecular functions; and expression of proteins
                                                                    in tumor samples.
Much work on table reading has focused on detecting tables
                                                                    (2) Tables containing information about climate conditions
or elements of tables, such as columns, rows, headers, and
                                                                    and crop yields, to populate crop-forecasting models.
stubs. [Fang et al., 2012; Hurst, 2000; Wang and Hu, 2002].
                                                                    (3) Tables containing information relevant to weapons devel-
While such work is clearly important to the general problem
                                                                    opment, to be used by intelligence analysts. Much of the data
of table reading, it is not very relevant to our current work for
                                                                    that analysts use is in tables. The amount of data is much
two reasons. First, for the large PMC corpus (containing mil-
                                                                    too large for analysts to read; tools that would allow them to
lions of papers) on which we focus, we have solved the prob-
                                                                    query and retrieve answers would help these analysts work
lem of finding and extracting the physical elements of tables.
                                                                    more efficiently.
Second, we are mainly concerned with the semantics of the
tables, and these papers do not focus much on semantics.
   [Wong, 2008] studies the extraction of information from          7   Acknowledgements
biomedical tables. However, he limits himself to extracting         We gratefully acknowledge helpful comments and advice
named entities. We extract entities but also focus on schema        from Ron Ferguson, Ron Keesing, Ryan Murphy, Tifani
mapping, relation recognition, and extraction.                      O’Brien, Ibrahim Shafi, Mark Williams, Mark Clark, Ernie
Davis, Emek Demir, Danny Powell, and Ted Senator. This            [Wang and Hu, 2002] Yalin Wang and Jianying Hu. Detect-
work was supported by DARPA under contract W911NF-14-               ing tables in HTML documents. In Document Analysis
C-0119.                                                             Systems V, 5th International Workshop, DAS 2002, Prince-
                                                                    ton, NJ, USA, August 19-21, 2002, Proceedings, pages
References                                                          249–260, 2002.
[Cafarella et al., 2008] Michael J. Cafarella, Alon Y. Halevy,    [Wong, 2008] Wli Wong. Extracting named entities from ta-
   Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webta-                bles in biomedical literature, 2008. Honour’s thesis, Uni-
   bles: exploring the power of tables on the web. PVLDB,           versity of Melbourne.
   1(1):538–549, 2008.
[Cafarella et al., 2009] Michael J. Cafarella, Alon Y. Halevy,
   and Nodira Khoussainova. Data integration for the rela-
   tional web. PVLDB, 2(1):1090–1101, 2009.
[Cerami et al., 2011] Ethan G. Cerami, Benjamin E. Gross,
   Emek Demir, Igor Rodchenkov, Özgün Babur, Nadia An-
   war, Nikolaus Schultz, Gary D. Bader, and Chris Sander.
   Pathway commons, a web resource for biological pathway
   data. Nucleic Acids Research, 39(Database-Issue):685–
   690, 2011.
[Croft et al., 2014] David Croft, Antonio Fabregat Mundo,
   Robin Haw, Marija Milacic, Joel Weiser, Guanming Wu,
   Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R.
   Kamdar, Bijay Jassal, Steven Jupe, Lisa Matthews,
   Bruce May, Stanislav Palatnik, Karen Rothfels, Veron-
   ica Shamovsky, Heeyeon Song, Mark Williams, Ewan
   Birney, Henning Hermjakob, Lincoln Stein, and Peter
   D’Eustachio. The reactome pathway knowledgebase. Nu-
   cleic Acids Research, 42(Database-Issue):472–477, 2014.
[Fang et al., 2012] Jing Fang, Prasenjit Mitra, Zhi Tang, and
   C. Lee Giles. Table header detection and classification.
   In Proceedings of the Twenty-Sixth AAAI Conference on
   Artificial Intelligence, July 22-26, 2012, Toronto, Ontario,
   Canada., 2012.
[Hurst, 2000] Matthew Francis Hurst. The interpretation of
   tables in text, 2000. Ph.D. thesis, University of Edinburgh.
[Mulwad et al., 2014] Varish Mulwad, Tim Finin, and Anu-
   pam Joshi. Interpreting medical tables as linked data for
   generating meta-analysis reports. In Proceedings of the
   15th IEEE International Conference on Information Reuse
   and Integration, IRI 2014, Redwood City, CA, USA, Au-
   gust 13-15, 2014, pages 677–686, 2014.
[Pimplikar and Sarawagi, 2012] Rakesh Pimplikar and
   Sunita Sarawagi. Answering table queries on the web
   using column keywords. PVLDB, 5(10):908–919, 2012.
[Rodchenkov et al., 2013] Igor Rodchenkov, Emek Demir,
   Chris Sander, and Gary D. Bader. The biopax validator.
   Bioinformatics, 29(20):2659–2660, 2013.
[Valenzuela-Escarcega et al., 2015] Marco              Antonio
   Valenzuela-Escarcega, Gustave Hahn-Powell, Mihai
   Surdeanu, and Thomas Hicks. A domain-independent
   rule-based framework for event extraction. In Proceed-
   ings of the 53rd Annual Meeting of the Association for
   Computational Linguistics and the 7th International Joint
   Conference on Natural Language Processing of the Asian
   Federation of Natural Language Processing, ACL 2015,
   July 26-31, 2015, Beijing, China, System Demonstrations,
   pages 127–132, 2015.

</pre>