1. Introduction

Deeper Clinical Document Understanding Using Relation Extraction

Hasham Ul Haq

hasham@johnsnowlabs.com 0 1 2

Veysel Kocaman

veysel@johnsnowlabs.com 0 1 2

David Talby

david@johnsnowlabs.com 0 1 2 0 Figure 1: A Relation Extraction model semantically relating 1 John Snow Labs inc. 16192 Coastal Highway, Lewes, DE 19958 , USA 2 Relation Extraction , Natural Language Understanding, Natural Language Processing, BERT, Spark, Deep Learning, Adverse

2022

The surging amount of biomedical literature & digital clinical records presents a growing need for text mining techniques that can not only identify but also semantically relate entities in unstructured data. In this paper we propose a text mining framework comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models, which expands on previous work in three main ways. First, we introduce two new RE model architectures - an accuracy-optimized one based on BioBERT and a speed-optimized one utilizing crafted features over a Fully Connected Neural Network (FCNN). Second, we evaluate both models on public benchmark datasets and obtain new state-of-the-art F1 scores on the 2012 i2b2 Clinical Temporal Relations challenge (F1 of 73.6, +1.2% over the previous SOTA), the 2010 i2b2 Clinical Relations challenge (F1 of 69.1, +1.2%), the 2019 Phenotype-Gene Relations dataset (F1 of 87.9, +8.5%), the 2012 Adverse Drug Events Drug-Reaction dataset (F1 of 90.0, +6.3%), and the 2018 n2c2 Posology Relations dataset (F1 of 96.7, +0.6%). Third, we show two practical applications of this framework - for building a biomedical knowledge graph and for improving the accuracy of mapping entities to clinical codes. The system is built using the Spark NLP library which provides a production-grade, natively scalable, hardware-optimized, trainable & tunable NLP framework.

1. Introduction Biomedical literature has witnessed exponential rise in

the past decade. MEDLINE currently holds more than dexed more than 5 million records in the past seven years alone [ 1 ]. Furthermore, public databases like https://clinicaltrials.gov have seen an explosion of trials data as the aftermath of the novel Covid-19 outbreak.

In addition, wide-spread adoption of Electronic Health Records (EHRs), has made copious amount of free-text data available in digital format. This unstructured data is usually documented by healthcare professionals during the course of patient care, such as clinical notes, discharge summaries, lab reports, and pathology reports [ 2 ]. While publications and literature are growing rapidly, there still lacks structured knowledge that can be easily processed by computer programs. Relation Extraction becomes even more pertinent in biomedical research as it can provide the critical links required to generate knowledge graphs for better analysis and research, and even text summarization. Relating entities also help us improve medical coding by enriching vanilla entity chunks with nEvelop-O

While the trend of training large transformer models continues, applying them on large datasets remains a challenge as they require significant computational resources. Furthermore, long documents containing high number of entity spans can exponentially increase probable entity pairs for RE classification - requiring significantly more resources and processing time.

In this study we focus on three major aspects of RE; the model architectures and their scalability, evaluating the models on benchmark datasets, and training and using RE for general use-cases. We also study the application of RE for understanding diferent aspects of clinical documents like extracting and relating dates to generate timeline of a patient’s data on a timeline, or parsing and understanding trial results on large cohorts for analysis.

Following are the novel contributions of this paper: • Introducing two new RE architectures. • Evaluating and comparing performance of the

proposed models on benchmark datasets. • Training the models on custom datasets and demon- Figure 2: Overview of the first RE model. All the features are strating how RE can be used to get a structured vertically stacked in a single feature vector. The feature vector output for specific use-cases. is kept dynamic with additional padding for compatibility across diferent embedding sizes, and complex dependency • Studying the use-case of putting the history and structures.

medical history of patients on a timeline. • Analyzing the benefits of using RE to get more precise entity chunks for achieving better performance while mapping them to medical codes.

the model in Apache Spark for scalability, take checkpoints from the BioBERT model, and train an end-to-end BERT model for RE. Similar to the first solution, this ar2. Approach chitecture also depends on the entity spans identified by the base NER model, and uses the entire document as conWe treat RE as a classification problem where each exam- text string while training the model. The original paper ple is a pair of biomedical entities appearing in a given used sequence length of 128 tokens for the context string, context - the entities being NER chunks, and context be- which we keep constant, and instead experiment with the ing the sentence / entire document - and develop two content of the context string, training data augmentation, novel solutions; the first one comprising of a simpler and fine-tuning techniques.

FCNN architecture for speed, and the second one based We use Spark NLP’s [ 7 ] NER models [8] as foundation on the BioBERT [ 5 ] architecture for accuracy. We exper- for the RE models as these NER models provide entity iment both approaches and compare their results. spans required for performing RE. In a single inference

For our first RE solution we rely on entity spans and pipeline, the RE models are placed sequentially after the types identified by the NER model to develop distinct fea- the NER model, and are fed the results of the NER model, tures to feed to an FCNN for classification. At first we gen- the context, embeddings, and dependency tree for feature erate distinct pairs of entities (e.g. symptom-treatment), generation. Apart from feature generation, the depenand then generate custom features for each pair. These dency tree also helps regularize candidate entity pairs features include semantic similarity of the entities, syn- for RE classification as we can eliminate pairs having tactic distance of the two entities, dependency structure a larger syntactic distance. This modular approach of of the entire document, embedding vectors of the entity arranging components reduces coupling and achieves a spans, as well as embedding vectors for 100 tokens within higher degree of memory and computational eficiency the vicinity of each entity. Figure 2 explains our model as components like sentences, tokens, and embeddings architecture in detail. We then concatenate these features are shared between NER and RE models and don’t need and feed them to fully connected layers with leaky relu to be executed again. Since the NER model is essentially activation. We also use batch normalisation after each a token classifier and produces prediction per token, we afine transformation before feeding to the final softmax convert the tokens to chunks using BIO tags. layer with cross-entropy loss function. We use softmax cross-entropy instead of binary cross-entropy loss to keep the architecture flexible for scaling on datasets having 3. Experiments multiple relation types.

Our second solution focuses on a higher accuracy, as well as exploration of relations across long documents, and is based on [ 6 ]. In our implementation, we implement We test the models on public datasets, report evaluation metrics, and analyse the results on examples. In addition to public datasets, we explain the process of annotating and training models on new datasets. We then study the utility of applying RE for some use-cases like knowledge graph generation and improved entity resolution (the process of mapping entity chunks to medical codes).

Dataset 1k Notes 10k Notes FCNN RE Model BERT RE Model

3.1. Performance on Public Datasets

We test both model architectures on seven public datasets by using the oficial training-test split for training and testing the models, and report macro-average f1 scores for each one of them in Table 1. These datasets include the 2012 i2b2 challenge for evaluating temporal relations in test result) that can compliment core entities (e.g, sympclinical text [9], the 2010 i2b2/VA challenge on concepts, tom, procedure, test) as the first entity and disjoint entity assertions, and relations in clinical text [10], the Drug- types - meaning the entities should not have relation Drug-Interaction (DDI) dataset for linking drugs with among themselves - from the the core entities for the dispositions and reactions [11], the Chemical–protein second entity as explained in Table 3. Since the first eninteraction (CPI) dataset for linking genes/proteins with tity can relate to multiple entities in the second column, drug chemicals [12], the Phenotype-Gene Relations (PGR) we can define the relation between the two entity types dataset for relating human phenotypes and genes [13], as one-to-many, and can keep the relation types to a the adverse drug events dataset for relating drugs with minimum i.e. are the two entities related or not. This their reactions [14], and the posology relations task based approach helps reduce annotation complexity resulting on the 2018 n2c2 task [15]. For the sake of brevity we in faster annotation times, and a higher inter-annotator don’t delve into the details for each dataset, and specific agreement. For annotation purposes we utilized the pubdetails for each dataset can be found in the cited resources. licly available Annotation Lab tool. As explained in Table 1, the BERT model achieves new SOTA metrics on 5 public datasets, and out performs model Entity 1 Entity 2 tnheessl.igHhtoewrFeCveNr,Nitmisodmeolrdeuethtaonbe3ttteimrceosnltoewxteuraal nadwahraes- re_bodypart_procedure_test Body Part ProTceesdture much higher memory requirements. Table 2 compares trhamesepteeredsedttiifenrgenacnedosfatmhepltewPoyathrcohnitceocdtuerfeosr. tHrayipneinrgpaa-n re_test_result_date Test TesDtRateesult RE model from scratch is in Appendix A & C. re_bodypart_problem Body Part Symptom

FCNN BioBERT Curr-SOTA Dataset i2b2-Temporal i2b2-Clinical DDI

CPI PGR

ADE Corpus Posology

68.7 60.4 69.2 65.8 81.2 89.2 87.8 73.6 69.1 72.1 74.3 87.9 90.0 96.7 72.41 67.97 84.1 88.9 79.4 83.7 96.1 re_test_problem_finding re_bodypart_directions Test Body Part Symptom Direction

4. Practical Applications of Relation Extraction

4.1. Generating Knowledge Graph with

Relations Most notable benefit of RE is the ability to generate knowledge graphs from unstructured text. For this experiment, we used pretrained Spark NLP NER models and the general-purpose RE models explained in the previous section to process medical reports with the primary goal of generating a concise structured output of a report. For instance, we relate procedures with dates and findings to recognize dates of a procedure and its findings along with any existing condition. We use the relations between body parts and procedures to get more specific details of the location of the procedure. Similarly, relating body parts with findings like test results and measurements can add more details to the final output in specific use-cases. More granularity can be achieved by having further subdivisions of body parts. For instance, in our experiment, we divide the body part in three parts; the primary body part (e.g, lung), a sub-part (e.g, lobe), and direction/laterlity (e.g, left) of the body part. In practice, these specifc entities trickle from the NER model down to the RE models. A graph generated from a sample report can be seen in Figure 5.

Furthermore, the structured data can help create a patient timeline which can show progress of a certain 4.2. Enriching Chunks for more Accurate

Coding

Entity Resolver models map entity chunks to medical

codes like CPT [23], ICD [24], SNOMED [25], MeSH [26], RxNorm [27] etc based on semantic similarity. This task becomes challenging due to two major reasons. First, the inherent noise of the text like abbreviations, acronyms, and synonyms can result in false positive results. Second, medical codes are sensitive to variables like severity, location in human body, administration type, diagnosis method, etc; For a given condition or treatment, there could be diferent codes (within the same ontology) depending on the aforementioned factors. This challenge

Ontology

CPT SNOMED CT ICD-10 SNOMED CT Scan CT Scan Lesion Lesion is more prominent in ontologies with wider vocabularies relations within a certain syntactic span as even BERT like SNOMED. models have token sequence limit. A future research

RE provides solution to both problems; First, it intrin- direction could be to focus on improving contextual repsically cleans the input for the resolver models of stop resentation of large documents to allow relations over words and noise without additional efort. Second, it adds lengthy contextual spans. A second future research diadditional information to the core entity chunks from rection is to test whether auxiliary data - either from surrounding context; With the help of relations, simple medically annotated data or through transfer learning entities can be enriched with precise information to get from healthcare-specific language models - can deliver accurate codes. For example, a chunk CT Scan - identified higher accuracy Relation Extraction on the same neural as a procedure - can be enriched with the imaging tech- network architectures. nique to achieve a more accurate CPT/SNOMED code.

Enriching it further with the location of the procedure (e.g, chest) would result in an even accurate chunk that References can be resolved to a more specific CPT/SNOMED code.

Table 4 compares base chunks with enriched chunks that include body parts, demonstrating the benefits of enriched entity chunks for improved coding.

5. Conclusion In this paper we presented two new model architectures

for RE while enabling scalability. We then tested the models on public datasets and reported evaluation metrics. The model metrics show that the BioBERT based model outperforms the lighter FCNN model, and obtains new state-of-the-art accuracy on on three benchmarks.

However, for datasets with a small number of relation types, the simpler FCNN model may be a compelling option not only due to faster run times, but also much lower memory requirements compared to the BioBERT model, allowing to process larger datasets on commodity hardware. We also explain how to train RE models from scratch and describe the design behind the pre-trained models available as part of Spark NLP library.

We then study practical use cases where RE plays the salient role of linking entities together to generate knowledge graphs, patient timelines, and structured summaries of medical notes. Relating dates to primary procedures and problems can help create a timeline for each patient.

Finally, using granular NER models together with discrete RE models to clean and enrich entity chunks enables better entity resolution to clinical codes.

Given the complex nature of RE, and the pivotal role of contextual information, a common approach is to limit understanding at scale, Software Impacts 8 (2021) drug events and medication extraction in electronic 100058. URL: https://www.sciencedirect.com/ health records, Journal of the American Medical science/article/pii/S2665963821000063. doi:https: Informatics Association 27 (2020) 3–12.

//doi.org/10.1016/j.simpa.2021.100058. [16] H. Guan, J. Li, H. Xu, M. V. Devarakonda, [8] V. Kocaman, D. Talby, Improving clinical document Robustly pre-trained neural model for direct understanding on COVID-19 research with spark temporal relation extraction, CoRR abs/2004.06216 NLP, CoRR abs/2012.04005 (2020). URL: https:// (2020). URL: https://arxiv.org/abs/2004.06216. arxiv.org/abs/2012.04005. arXiv:2012.04005. arXiv:2004.06216. [9] W. Sun, A. Rumshisky, O. Uzuner, Evaluating [17] D. Ningthoujam, S. Yadav, P. Bhattacharyya, A. Ektemporal relations in clinical text: 2012 i2b2 chal- bal, Relation extraction between the clinical entilenge, Journal of the American Medical Informat- ties based on the shortest dependency path based ics Association : JAMIA 20 (2013). doi:10.1136/ LSTM, CoRR abs/1903.09941 (2019). URL: http: amiajnl-2013-001628. //arxiv.org/abs/1903.09941. arXiv:1903.09941. [10] Ö. Uzuner, B. R. South, S. Shen, S. L. DuVall, 2010 [18] M. Asada, M. Miwa, Y. Sasaki, Using drug i2b2/va challenge on concepts, assertions, and re- descriptions and molecular structures for lations in clinical text, Journal of the American drug–drug interaction extraction from literaMedical Informatics Association 18 (2011) 552–556. ture, Bioinformatics 37 (2020) 1739–1746. URL: [11] M. Herrero-Zazo, I. Segura-Bedmar, P. Martínez, https://doi.org/10.1093/bioinformatics/btaa907.

T. Declerck, The ddi corpus: An annotated doi:10.1093/bioinformatics/btaa907. corpus with pharmacological substances arXiv:https://academic.oup.com/bioinformatics/articleand drug–drug interactions, Journal of pdf/37/12/1739/39119268/btaa907.pdf. Biomedical Informatics 46 (2013) 914–920. [19] L. N. Phan, J. T. Anibal, H. Tran, S. Chanana, E. BaURL: https://www.sciencedirect.com/science/ hadroglu, A. Peltekian, G. Altan-Bonnet, Scifive: a article/pii/S1532046413001123. doi:https: text-to-text transformer model for biomedical lit//doi.org/10.1016/j.jbi.2013.07.011. erature, CoRR abs/2106.03598 (2021). URL: https: [12] M. Krallinger, O. Rabal, S. A. Akhondi, M. P. Pérez, //arxiv.org/abs/2106.03598. arXiv:2106.03598.

J. Santamaría, G. P. Rodríguez, G. Tsatsaronis, A. In- [20] D. Sousa, F. M. Couto, Biont: Deep learning ustxaurrondo, J. A. B. López, U. K. Nandal, E. M. van ing multiple biomedical ontologies for relation exBuel, A. Chandrasekhar, M. Rodenburg, A. Laegreid, traction, CoRR abs/2001.07139 (2020). URL: https: M. A. Doornenbal, J. Oyarzábal, A. Lourenço, A. Va- //arxiv.org/abs/2001.07139. arXiv:2001.07139. lencia, Overview of the biocreative vi chemical- [21] P. Crone, Deeper task-specificity improves protein interaction track, 2017. joint entity and relation extraction, CoRR [13] D. Sousa, A. Lamurias, F. M. Couto, A silver abs/2002.06424 (2020). URL: https://arxiv.org/abs/ standard corpus of human phenotype-gene rela- 2002.06424. arXiv:2002.06424. tions, in: Proceedings of the 2019 Conference of [22] X. Yang, Z. Yu, Y. Guo, J. Bian, Y. Wu, Clinical the North American Chapter of the Association relation extraction using transformer-based models, for Computational Linguistics: Human Language CoRR abs/2107.08957 (2021). URL: https://arxiv.org/ Technologies, Volume 1 (Long and Short Papers), abs/2107.08957. arXiv:2107.08957.

Association for Computational Linguistics, Min- [23] AMA, Cpt, https://www.ama-assn.org/amaone/ neapolis, Minnesota, 2019, pp. 1487–1492. URL: cpt-current-procedural-terminology, 2020. Achttps://aclanthology.org/N19-1152. doi:10.18653/ cessed: 2021-12-22.

v1/N19-1152. [24] WHO, Icd10, https://www.who.int/standards/ [14] H. Gurulingappa, A. M. Rajput, A. Roberts, classifications/classification-of-diseases, 2019.

J. Fluck, M. Hofmann-Apitius, L. Toldo, De- Accessed: 2021-12-22. velopment of a benchmark corpus to sup- [25] NLM, Snomed ct, https://www.nlm.nih.gov/ port the automatic extraction of drug-related healthit/snomedct/index.html, 2019. Accessed: adverse efects from medical case reports, 2021-12-22.

Journal of Biomedical Informatics 45 (2012) [26] NLM, Mesh, https://www.nlm.nih.gov/mesh/ 885–892. URL: https://www.sciencedirect.com/ meshhome.html, 2021. Accessed: 2021-12-22. science/article/pii/S1532046412000615. doi:https: [27] NLM, Rxnorm, https://www.nlm.nih.gov/research/ //doi.org/10.1016/j.jbi.2012.04.008, text umls/rxnorm/index.html, 2021. Accessed: 2021-12Mining and Natural Language Processing in 22.

Pharmacogenomics. [28] JSL, Training code for re, https://github.com/ [15] S. Henry, K. Buchan, M. Filannino, A. Stubbs, JohnSnowLabs/spark-nlp-workshop/blob/master/ O. Uzuner, 2018 n2c2 shared task on adverse tutorials/Certification_Trainings/Healthcare/10.3.

A. A. Hyperparameter Settings Since optimal hyperparameter values vary for each dataset, a range of values which performed best in all the datasets can be seen in Table 5. B. B. Preparing training data for RE model in Spark NLP

Since RE is a classification task, the primary inputs are the context string (sentence), and a pair of entities. If there are multiple pairs in a single context string, we treat them as disjoint inputs as each input encapsulates the required inputs like entity chunk pairs and context - which are then used to create input features. We can create a csv formatted file where each row is a training example for the model, and contains the aforementioned inputs. Exact schema of the training file can be found in the training notebook [28].

C. C. Training an RE Model in Spark NLP Code for training an RE mode is provided as a google

colab notebook [28]. As majority of the public datasets are protected and can not be shared, they need to be obtain from their oficial websites and converted to the required format before training.

[1]

Yadav ,

Ramesh ,

Saha ,

Ekbal , Relation extraction from biomedical and clinical text: Unified multitask learning framework , CoRR abs/ 2009 .09509 ( 2020 ). URL: https://arxiv.org/abs/ 2009 .09509. arXiv: 2009 .09509.

[2]

Wei ,

Ji ,

Si ,

Du ,

Wang ,

Tiryaki ,

Wu ,

Tao ,

Roberts ,

Xu , Relation extraction from clinical narratives using pre-trained language models , in: AMIA Annual Symposium Proceedings , volume 2019 , American Medical Informatics Association, 2019 , p. 1236 .

[3]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[4]

Wang ,

Lu , Two are better than one: Joint entity and relation extraction with table-sequence encoders , CoRR abs/ 2010 .03851 ( 2020 ). URL: https: //arxiv.org/abs/ 2010 .03851. arXiv: 2010 .03851.

[5]

Lee ,

Yoon ,

Kim ,

C. H.

So ,

Kang , Biobert: a pre-trained biomedical language representation model for biomedical text mining , CoRR abs/ 1901 .08746 ( 2019 ). URL: http://arxiv.org/ abs/ 1901 .08746. arXiv: 1901 .08746.

[6]

L. B.

Soares , N. FitzGerald, J. Ling, T. Kwiatkowski, Matching the blanks: Distributional similarity for relation learning , CoRR abs/ 1906 .03158 ( 2019 ). URL: http://arxiv.org/abs/ 1906 .03158. arXiv: 1906 .03158.

[7]

Kocaman ,

Talby , Spark nlp: Natural language Clinical_RE_SparkNLP_Paper_Reproduce.ipynb, 2021 . Accessed: 2021 -12-23.