-

1613-0073

for Tables in Scientific Literature

Anupam Joshi

joshi@umbc.edu 2 3 4

Varish Mulwad

varish.mulwad@ge.com 1 2 4

Tim Finin

finin@umbc.edu 2 3 4

Vijay S. Kumar

v.kumar@ge.com 0 2 4

Jenny Weisenberg Williams

0 2 4

Sharad Dixit

sharad.dixit@ge.com 0 2 4 0 GE Research , 1 Research Circle, Niskayuna, NY , USA 1 GE Research, John F. Welch Technology Center , Whitefield, Bengaluru , India 2 Third AAAI Workshop on Scientific Document Understanding , 2023 3 University of Maryland , Baltimore County, 1000 Hilltop Circle, Baltimore, MD , USA 4 Workshop Proce dings

2020

Entity linking is an essential step towards constructing knowledge graphs that facilitate advanced question answering over scientific documents-including the retrieval of relevant information present in tables within these documents. This paper introduces a general-purpose system for linking entities to items in the Wikidata knowledge base. It describes how we adapt this system for linking domain-specific entities, especially those embedded within tables drawn from COVID-19-related scientific literature. We describe the setup of an eficient ofline instance of the system that enables our entity-linking approach to be more feasible in practice. As part of a broader approach to infer the semantic meaning of scientific tables, we leverage the structural and semantic characteristics of the tables to improve overall entity linking performance.

entity linking knowledge graph tables scientific documents

CEUR ceur-ws.org

1. Introduction

The rapid pace of research in dynamic, fast-evolving scenarios, as recently exemplified by COVID-19 and the ject [1], has necessitated more machine-driven, humaninterpretable approaches to scientific knowledge discovery. Open datasets like CORD-19 [2] have motivated novel techniques and tools for keyword/semantic search and Q&A, recommendation, and summarization of scientific documents. As with the web, discovery from scientific literature is predominantly associated with searching over unstructured textual content. Domain-specific neural search engines [3, 4] typically produce ranked lists of matching articles in response to search requests, while mainstream information retrieval methods may also deliver direct short, targeted responses (drawn from text) to queries. To facilitate such a search, Sohrab et al. [5] set of CORD-19 articles to demonstrate the fundamental tasks of named entity recognition and entity linking for

COVID-19-related entities found in the text. Besides text, alternative modalities such as tables and

∗Corresponding author. (A. Joshi) (A. Joshi) CEUR htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR

Workshop Proceedings (CEUR-WS.org) charts have come to play a considerable role in how the scientific community succinctly conveys descriptive information in the literature. Our experience assembling a corpus of over 62,000 open-access coronavirus-related arover 120,000 tables, underlining a wealth of latent knowledge embedded within these structured artifacts. The extraction and retrieval of relevant information from these scientific tables is becoming increasingly critical to emerging knowledge-driven applications. For example, consider a genomic surveillance scenario seeking information on treatment eficacies against the top prevalent COVID-19 variants in each US state. Better responses to such queries entail going beyond text and searching relevant portions of or entire scientific tables for vital knowledge nuggets, possibly fusing information from multiple source tables on the fly.

Although learning-based representational models for

relationally structured web tables, these models are typically not tuned to unconventional structural complexity. This is especially true for the dense and often implicit semantics and difuse context inherent in scientific tables in highly specialized domains [8]. Representing scientific tables as semantically annotated linked data artifacts accounts for structural complexities and enables explicit reasoning over tabular content to infer their semantics and relevance to search queries. Hence, entity linking is fundamental to our end-to-end pipeline for constructing such knowledge graphs of tables drawn from scientific documents, as depicted in Figure 1.

This paper presents an entity linking system to auto

introduced the BENNERD system and an annotated sub- tabular data [7] show great promise for understanding matically map the content of individual cells in scientific each property has an identifier starting with P. The proptables to appropriate entries in the Wikidata knowledge erty P31 (instance of) links an item with its immediate base [9]. To keep up with the scientific literature info- types, P279 (subclass of) links a concept item to its imdemic, we architected a more eficient local, ofline link- mediate supertypes, and P1647 (subproperty of) links ing system using periodic Wikidata knowledge dumps. properties to their immediate super-properties. While the ensuing eficiency gains make our system more An entity has just one label in a given language, its feasible in practice, we discuss the implications for link- “canonical name”. An entity can have any number of ing performance. aliases in a language and can have a short description in any language. Unlike other open knowledge graphs, Wikidata includes and links to specialized knowledge 2. Entity Linking for Scientific from additional domain-specific knowledge resources.

Text and Tables These include the Unified Medical Language System (UMLS) [10] knowledge base and the Medical Subject Given a mention of an entity in a document and a unique Headings (MeSH) thesaurus [11], which bring together set of known entities defined in some knowledge base, biomedical vocabularies and standards to enable interopentity linking refers to finding and assigning the entity erability.

ID corresponding to the mentioned entity. Entities play Figure 2 shows an example of a simple scientific table an essential role in text and are often used to describe with links to appropriate Wikidata items highlighting what the text is about. Likewise, linking entity men- several high-level issues we addressed. One is that we tions in the header and body cells of tables, as well as must consider the “header” cells (whether for columns or linking entities in captions or other referring text, can rows) diferently from the regular table body cells. Note help partly understand or infer the semantic meaning of that the third column’s header cell, Prevalence, has two tables. We developed a general-purpose linker to link good candidate links: the concept Q719602 (“number of entity mentions in text to items in (and to further ex- disease cases in a given population at a specific time”) and tract useful information about items from) Wikidata. We the property P1193 (“portion in percent of a population describe the linker’s customization and inner workings with a given disease or disorder”). We give preference in for linking highly specialized, idiomatic content within such cases to using the property item over the concept header and body cells of tables drawn from a corpus of item.

COVID-19-related scientific literature. The middle header cell containing the text Lineage illustrates a second issue: A simple linker might choose 2.1. Wikidata: Reference Knowledge Base the most common match for this based only on the text, Q1517820 (“line of ancestors and descendants of a person”). However, the cells in this column (e.g., B.1.1.7) are all easily matched to Wikidata items whose immediate type is Q104450895 (“variant of SARS-CoV-2”). Therefore, we need to do joint inference using both the header cell and a sample of its data cells to choose the best links for both.

The first column of the table highlights a third aspect of the task: mining additional knowledge from resources Wikidata [9] is a collaboratively edited multilingual knowledge graph used to provide common data for Wikimedia projects, with currently about 1.2 billion facts on over 102 million items. Wikidata’s ontology has a finegrained type system with more than two million types and about 11 thousand properties, including an item’s label, aliases, and description. Each Wikidata item has a unique identifier beginning with Q, like Q3519875 (“National Institute of Allergy and Infectious Diseases”), and connected to candidate Wikidata items. Wikidata items part of the linker’s configuration for a domain and task. often link to other knowledge graphs, such as DBpedia These are ordered from best to worst as follows: (1) Tar[12], that contain additional useful information. DBpedia, get types are those we want to find based on the mention for example, has a short paragraph describing its items type identified by an NLP system; (2) Near-miss types and links to types in the Yago fine-grained type system are close to the target types and often confused with the [13]. targets by an NLP system; (3) Good types are ones that are very relevant to the domain, such as a MESH term 2.2. Core Entity Linking Algorithm (Medical Subject Heading); (4) OK types include types that are acceptable and common in many domains, such Our entity linker takes a mention string (e.g., from a table as organizations, people, geo-political entities, and locaheader or cell) and begins by retrieving a pre-specified tions; and (5) Bad types are ones we are not interested in number of Wikidata items using the MediaWiki search (e.g., fictional characters, journal articles, musical groups) API. This returns a ranked list containing each item’s and result in a candidate being immediately rejected. Wikidata ID, label, aliases, and English language descrip- The type names of interest are mapped to Wikidata tion. Next, we rerank candidates to promote ones that types via the linker’s configuration dictionary. Extendresulted in an exact match of their mention string with ing this dictionary-enabled us to easily customize our a Wikidata item’s label (best) or alias (second best). For linker to specific domains, such as COVID-19-related scieach candidate, we use a SPARQL query to retrieve its entific research. For our domain, examples of good types types, both immediate (P31) and inherited, via a chain of are Wikidata high-level classes corresponding to disease, P279 links for concept super-classes and P1647 links for protein, chemical compound, vaccine type, and type of property super-properties. statistic. OK types are those associated with the standard

For specific domains, our linker leverages the ultra- OntoNotes [14] types, such as person, event, facility, orifne-grained Wikidata type system to infer additional ganization, and location. Entities of these types often domain types for an item by checking for specific domain- occur in biomedical tables. Our bad types cover things relevant properties. We identified a custom set of Wiki- like songs, works of art, sports organizations, fictional data item types and properties to support entity linking things, and other high-level types unlikely to be present for the biomedical domain. For example, we infer the in medical tables. For example, there exist 83 Wikidata mesh item type if an item has a MESH descriptor ID prop- items with the canonical name “virus”. These include erty (P486) that connects the item with a UMLS Medical Q808, the infectious agent, as well as films, songs, musical Subject Heading. albums, rock groups, painting, video games, musicians,

When linking the text in a header cell, we give more professional wrestlers, and more. weight to candidates that are Wikidata properties. For Finally, we have a mapping of near-miss types that repexample, candidates for the text “location” include an resent types that are easily confused. A classic example is item representing the geographic location (Q2221906) the OntoNotes types FAC (for facility) and LOC (for locaas well as the property location (P276). While either tion) are easily confused by most NLP systems. An entity might be relevant, our annotation methodology strongly like Wuhan Institute of Virology can be marked as an ORG, preferred the latter. LOC, or FAC, depending on its context. Since locations

The linker’s filtering and ranking of candidate items are a common type in tables for this domain, we can treat are based initially on analyzing an item’s types. This an item identified as a FAC or ORG by a language procestype of analysis is controlled by five lists of types that are sor as possibly referring to a location. Additional ranking for an item’s prominence is then done using its number contemporary scientific publication rates. of sitelinks, i.e., the number of links to other Wikimedia To address these API rate-limit bottlenecks, we initially projects that contain information about the item. set up a transient caching layer for cell entity linking re

Beyond type analysis-based filtering, the last step is sults so that future requests to link the same mention the ranking of the final candidates using a context span or string would be served from the cache, avoiding API instring, if provided. The similarity of the context and the vocations. However, this strategy was insuficient, so we item’s description is computed with embeddings from the decoupled our core entity linker from the public WikispaCy [15] large language model and generates a score data altogether by architecting and progressively setting that is used along with the item’s rank in the candidate list up a more eficient system using local periodic dumps of to select and return the best link. This worked reasonably relevant Wikidata knowledge. well for both well-structured text (e.g., table captions) and The system is ofline because the linker no longer relies for collections of terms from the row and column headers on Wikidata APIs. Wikidata’s complex software archiand could be improved by using an embedding model tecture [17] and its enormous size make it challenging ifne-tuned on the biomedical domain. to replicate locally in its entirety. That said, our entity linker does not need all the capabilities that Wikidata ofers. We targeted emulation strategies addressing bot3. Eficient Entity Linking at Large tlenecks with cross-item graph search (via the Wikidata Scale query service (WDQS) and Wikidata’s underlying RDF triple store) and full-text search over items and their propOur entity linker initially used the Wikidata and Wiki- erties (via the Action API and underlying CirrusSearch media APIs to retrieve the initial ranked list of Wikidata Wikibase extension). We leverage proven open-source candidate items and their type and supertype informa- storage technologies such as the Elasticsearch engine and tion. Since Wikidata is a public resource, the APIs are the Redis key-value store to emulate underlying Wikidata understandably rate-limited such that unreasonable ac- capabilities, as depicted in Figure 4. cess requests and query rates in excess of established limits may lead to IP address blacklisting [16]. The table in Figure 3 breaks down our average observed entity linking time to link a single exemplar mention string to a Wikidata entity while operating under the above limits. Accessing public Wikidata APIs, our linker can operate no faster than around 30 seconds per entity. For our dataset of 120,000+ tables (a rate reflective of the COVID-19 infodemic), annotating even just 10 cells per table at this rate could end up taking over a year.

Furthermore, when applying entity linking to infer

table semantics (see next section), the linking of a single header cell could, in turn, translate to the linking of all other cells in the respective column or row—potentially placing far greater stress on the linker. As a result, while Wikidata APIs facilitated a proof of concept of our core entity linking algorithm, they cannot sustain a practical, scalable linking service capable of keeping up with

We implemented this system by uploading partial

JSON dumps of Wikidata items, their basic attributes (label, aliases, description), specific types, and ‘sitelinks’ counts1 into a local Elasticsearch index. This resulted in a locally searchable collection of 95.8M items. Ofline, we retrieved the current type hierarchy (by traversing P31 and P279 property relationships) and loaded the resulting dictionary, mapping each of Wikidata’s 2.6M types to its supertypes into Redis. This reduced determining if an entity was an instance of a given type (direct or inherited) to a dictionary lookup.

In this eficient entity linking system, an initial candidate search is performed using an Elasticsearch multimatch query that compares a mention string against labels and aliases. In lieu of Wikidata’s CirrusSearch rank

1A Wikidata item’s sitelinks property is the number of other Wiki

media sites such as Wikipedia, Wikisource, and Wikivoyage in which it appears. It is commonly used as a metric for the item’s importance. ing mechanisms, we use an item’s sitelinks count (i.e., popularity) as a proxy for its prominence and rank candidates in descending order of their sitelinks counts. Once we have a ranked list of candidates for each item, we query Redis using the item’s entity ID and direct types as keys to retrieve associated inherited types. Type analysis and re-ranking then proceed as before.

Figure 5 shows a progression in replacing Wikidata API invocations with queries to these local knowledge stores. The resulting system trades linking accuracy for a threefold improvement in linking eficiency, with the potential for even further speedups via parallel processing. The impact on entity linking performance is largely dictated by the quality of the initial ranked candidate list returned by our Elasticsearch query. We are exploring techniques like PageRank to estimate an item’s relative importance better.

4. Entity Linking to Infer Semantics of Tables

• Specialists: We use pattern-based or machinelearning approaches to independently assess commonly encoded data types in table cells to avoid linking those cell values that are deemed to be specific kinds of literals (e.g., RNA/DNA sequences or Clinical Trial IDs).

Our entity linking system achieves a fair degree of accuracy in linking table cells to Wikidata items. We based our evaluations on a manually annotated subset of 47 tables extracted from 45 COVID-19-related articles drawn randomly from PubMed Central [6]. Of the 910 table cells (out of a total of 3600 manually annotated cells in these tables) expected to be mapped to a Wikidata item, our linker achieved a recall of 0.82 when the expected annotation was part of the linker’s initial candidate item set, and a precision of 0.51 over the subset of these cells with expected Wikidata annotations.

5. Discussion and Conclusions Existing NLP tools for entity linking like spaCy [15] sup

The meaning of text derives from its constituent words, port a very limited entity type system, often based on which in turn are understood using grammatical knowl- just Ontonotes 5.0 types (e.g., PER, ORG, LOC, FAC) and edge and context provided by surrounding text. Inferring do not cover specialized scientific entities. The SemTab the intended meaning of tables additionally requires in- challenge on Tabular Data to Knowledge Graph Matchterpreting row/column headers and relations between ing focuses on three mapping tasks aimed at inferring the them, besides linking cell values to entities. To improve semantics of web tables [19]. While it recently included entity linking performance for inferring the semantics tables from biology literature, leading tabular entity linkof scientific tables, we supplement our core algorithm ing systems [20] do not adequately cover domain-specific with other techniques (beyond the scope of this paper), entities. Bespoke entity linking systems for COVID-19as shown in Figure 1. These include: related entities [5] link against UMLS and do not exploit the extensive type hierarchy or entity coverage of Wiki• Rule-based syntactic characterization: We cate- data.

gorize tables into types (e.g., horizontal) based on Part of our goal is to fill this missing gap with a practheir structure, tical entity linking system that can not only be adapted • Joint inference based on embeddings of Wikidata for domain-specific entities but can also help infer table items. We use Wembedder-driven [18] cluster- semantics with high accuracy by leveraging Wikidata’s ing operations to compute compatibility between rich type system. As entity linking of tables against Wikientities and to jointly assign entities to cells in a data at large scale is bottlenecked by rate-limited APIs column, and [21], we built an ofline version of our linking system, achieving a three-fold improvement in eficiency while sacrificing a tolerable reduction in linking performance.

Acknowledgments This research is based on work supported in part by the

Ofice of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via [2021-21022600004]. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the oficial policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.

International Conference on Information Reuse

and Integration (IEEE IRI 2014), 2014, pp. 677–686.

doi:10.1109/IRI.2014.7051955. [9] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the

ACM 57 (2014) 78–85. [10] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology,

Nucleic acids research 32 (2004) D267–D270. [11] C. E. Lipscomb, Medical subject headings (mesh),

Bulletin of the Medical Library Association 88 (2000) 265. [12] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,

C. Becker, R. Cyganiak, S. Hellmann, Dbpedia-a crystallization point for the web of data, Journal of web semantics 7 (2009) 154–165. [13] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: a [1] H. Else, Covid in papers: a torrent of science, Na- core of semantic knowledge, in: Proceedings of the ture (2020) 553–553. 16th international conference on World Wide Web, [2] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, 2007, pp. 697–706.

D. Burdick, D. Eide, K. Funk, Y. Katsis, R. M. Kinney, [14] R. Weischedel, S. Pradhan, L. Ramshaw, J. Kaufet al., Cord-19: The covid-19 open research dataset, man, M. Franchini, M. El-Bachouti, N. Xue, in: Proceedings of the 1st Workshop on NLP for M. Palmer, J. D. Hwang, C. Bonial, Ontonotes COVID-19 at ACL 2020, 2020. release 5.0, 2013. doi:DOI:https://doi.org/10. [3] E. Zhang, N. Gupta, R. Nogueira, K. Cho, J. Lin, 35111/xmhb- 2b84.

Rapidly deploying a neural search engine for [15] M. Honnibal, I. Montani, S. Van Landeghem, the covid-19 open research dataset: Preliminary A. Boyd, et al., spacy: Industrial-strength natural thoughts and lessons learned, in: ACL 2020 Work- language processing in python (2020). shop on Natural Language Processing for COVID- [16] Wikidata, Wikidata query service user man19 (NLP-COVID), 2020. ual, https://www.mediawiki.org/wiki/Wikidata_ [4] K. Hall, n nlu-powered tool to explore covid-19 Query_Service/User_Manual, 2022. Accessed: 2022scientific literature, https://ai.googleblog.com/2020/ 11-02. 05/an-nlu-powered-tool-to-explore-covid-19.html, [17] Wikidata, Wikidata architecture, https: 2020. Accessed = 2022-11-02. //upload.wikimedia.org/wikipedia/commons/ [5] M. G. Sohrab, K. Duong, M. Miwa, G. Topić, 2/2e/Wikidata_Architecture_Overview_-_High_ I. Masami, T. Hiroya, BENNERD: A neural Level.svg, 2018. Accessed: 2022-11-02. named entity linking system for COVID-19, in: [18] F. Å. Nielsen, Wembedder: Wikidata entity emQ. Liu, D. Schlangen (Eds.), Proceedings of the bedding web service, preprint arXiv:1710.04099 2020 Conference on Empirical Methods in Natu- (2017). ral Language Processing: System Demonstrations, [19] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, Association for Computational Linguistics, On- J. Chen, K. Srinivas, Semtab 2019: Resources to line, 2020, pp. 182–188. URL: https://aclanthology. benchmark tabular data to knowledge graph matchorg/2020.emnlp-demos.24. doi:10.18653/v1/2020. ing systems, in: Proceedings of the 17th Internaemnlp- demos.24. tional Conference European Semantic Web Confer[6] National Library of Medicine, PMC open access ence, Springer, 2020, pp. 514–530. subset, https://www.ncbi.nlm.nih.gov/pmc/tools/ [20] Y. Chabot, T. Labbé, J. Liu, R. Troncy, Dagobah: An openftlist/, 2022. Accessed = 2022-11-02. end-to-end context-free tabular data semantic anno[7] P. Yin, G. Neubig, W.-t. Yih, S. Riedel, Tabert: Pre- tation system, in: The 18th International Semantic training for joint understanding of textual and tab- Web Conference, 2019, pp. 41–48. ular data, in: Proceedings of the 58th Annual Meet- [21] P. Nguyen, H. Takeda, Wikidata-lite for knowledge ing of the Association for Computational Linguis- extraction and exploration, in: 2022 IEEE Intertics, 2020, pp. 8413–8426. national Conference on Big Data (Big Data), IEEE, [8] V. Mulwad, T. Finin, A. Joshi, Interpreting medical 2022, pp. 3684–3686.

tables as linked data for generating meta-analysis reports, in: Proceedings of the 2014 IEEE 15th