Improving Clinical Document Understanding on COVID-19 Research with Spark NLP Veysel Kocaman, David Talby John Snow Labs Inc. 16192 Coastal Highway Lewes, DE , USA 19958 {veysel, david}@johnsnowlabs.com Abstract care of their patients. Information fed into these systems may be found in structured fields for which values are inputted Following the global COVID-19 pandemic, the number of sci- entific papers studying the virus has grown massively, leading electronically (e.g. laboratory test orders or results) (Liede to increased interest in automated literate review. We present a et al. 2015) but most of the time information in these records clinical text mining system that improves on previous efforts is unstructured making it largely inaccessible for statistical in three ways. First, it can recognize over 100 different entity analysis (Murdoch and Detsky 2013). These records include types including social determinants of health, anatomy, risk information such as the reason for administering drugs, previ- factors, and adverse events in addition to other commonly used ous disorders of the patient or the outcome of past treatments, clinical and biomedical entities. Second, the text processing and they are the largest source of empirical data in biomedi- pipeline includes assertion status detection, to distinguish be- cal research, allowing for major scientific findings in highly tween clinical facts that are present, absent, conditional, or relevant disorders such as cancer and Alzheimer’s disease about someone other than the patient. Third, the deep learn- (Perera et al. 2014). ing models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pre-trained A primary building block in such text mining systems is named entity recognition models, and improving on the previ- named entity recognition (NER) - which is regarded as a ous best performing benchmarks for assertion status detection. critical precursor for question answering, topic modelling, We illustrate extracting trends and insights - e.g. most frequent information retrieval, etc (Yadav and Bethard 2019). In the disorders and symptoms, and most common vital signs and medical domain, NER recognizes the first meaningful chunks EKG findings – from the COVID-19 Open Research Dataset out of a clinical note, which are then fed down the processing (CORD-19). The system is built using the Spark NLP library pipeline as an input to subsequent downstream tasks such as which natively supports scaling to use distributed clusters, clinical assertion status detection (Uzuner et al. 2011), clini- leveraging GPU’s, configurable and reusable NLP pipelines, healthcare-specific embeddings, and the ability to train models cal entity resolution (Tzitzivacos 2007) and de-identification to support new entity types or human languages with no code of sensitive data (Uzuner, Luo, and Szolovits 2007) (see Fig- changes. ure 1). However, segmentation of clinical and drug entities is considered to be a difficult task in biomedical NER systems because of complex orthographic structures of named entities 1 Introduction (Liu et al. 2015). The COVID-19 pandemic brought a surge of academic re- The next step following an NER model in the clinical NLP search about the virus - resulting in 23,634 new publications pipeline is to assign an assertion status to each named entity between January and June of 2020 (da Silva, Tsigaris, and given its context. The status of an assertion explains how Erfanmanesh 2020) and accelerating to 8,800 additions per a named entity (e.g. clinical finding, procedure, lab result) week from June to November on the COVID-19 Open Re- pertains to the patient by assigning a label such as present search Dataset (Wang et al. 2020). Such a high volume of (”patient is diabetic”), absent (”patient denies nausea”), con- publications makes it impossible for researchers to read each ditional (”dyspnea while climbing stairs”), or associated with publication, resulting in increased interest in applying natural someone else (”family history of depression”). In the context language processing (NLP) and text mining techniques to of COVID-19, applying an accurate assertion status detection enable semi-automated literature review (Cheng, Cao, and is crucial, since most patients will be tested for and asked Liao 2020). about the same set of symptoms and comorbidities - so lim- In parallel, there is a growing need for automated text iting a text mining pipeline to recognizing medical terms mining of Electronic health records (EHRs) in order to find without context is not useful in practice. clinical indications that new research points to. EHRs are In this study, we introduce a set of pre-trained NER models the primary source of information for clinicians tracking the that are all trained on biomedical and clinical datasets within Copyright © 2021 for this paper by its authors. Use permitted under a Bi-LSTM-CNN-Char deep learning architecture, and a Bi- Creative Commons License Attribution 4.0 International (CC BY LSTM based assertion detection module built on top of the 4.0). Spark NLP software library. We then illustrate how to ex- Figure 1: Named Entity Recognition is a fundamental building block of medical text mining pipelines, and feeds downstream tasks such as assertion status, entity linking, de-identification, and relation extraction. tract knowledge and relevant information from unstructured network architecture that automatically detects word and electronic health records (EHR) and COVID-19 Open Re- character-level features using a hybrid bidirectional LSTM search Dataset (CORD-19) by combining these models in a and CNN architecture, eliminating the need for most feature pipeline. Using state-of-the-art deep learning architectures, engineering steps. The detailed architecture of the framework Spark NLP’s NER and Assertion modules can also be ex- in the original paper is illustrated at Figure 2 and a sample tended to other spoken languages with zero code changes and predictions from a set of pre-trained clinical NER models can scale up in Spark clusters. Moreover, by utilizing Apache from a text taken from CORD-19 dataset is shown in 3. Spark, both training and inference of full NLP pipelines can scale to make the most of distributed Spark clusters. Due to brevity concerns, the implementation details and training metrics of these models will be kept out of the scope of this study. The specific novel contributions of this paper are: • Introducing a medical text mining pipeline composed of state-of-the-art, healthcare-specific NER models • Introducing a clinical assertion status detection model that establishes a new state-of-the-art level of accuracy on a widely used benchmark • Describing how to apply these models in a unified, per- formant, and scalable pipeline on documents from the CORD-19 dataset. The remainder of the paper is organized as follows: Sec- tion 2 Introduces the Spark NLP library, summarizes the NER and assertion detection model frameworks it implements, and Figure 2: Overview of the original BiLSTM-CNN-Char ar- elaborates the named entities in each pre-trained NER model. chitecture (Chiu and Nichols 2016). Section 4 explains how to build a prediction pipeline to ex- tract named entities and assign assertion statuses from a set of documents on a cluster with Spark NLP. Section 5 discusses In Spark NLP, this architecture is implemented using Ten- benchmarking speed and scalability issues and Section 6 sorFlow, and has been heavily optimized for accuracy, speed, concludes this paper by summarizing key points and future scalability, and memory utilization. This setup has been directions. tightly integrated with Apache Spark to let the driver node run the entire training using all the available cores on the driver node. There is a CuDA version of each TensorFlow 2 Named Entity Recognition in Spark NLP component to enable training models on GPU when available. The deep neural network architecture for named entity recog- The Spark NLP provides open-source API’s in Python, Java, nition in Spark NLP is based on the BiLSTM-CNN-Char Scala, and R - so that users do not need to be aware of the framework. It is a modified version of the architecture pro- underlying implementation details (TensorFlow, Spark, etc.) posed by Chiu et.al. (Chiu and Nichols 2016). It is a neural in order to use it. Figure 3: Sample predictions from pre-trained clinical NER models in Spark NLP for Healthcare The full list of the entities for each pre-trained medical mentioned in the patient report but associated with someone- NER model is available in Appendix D, the accuracy metrics else (Uzuner et al. 2011). are given in Table 1 and a sample Python code for training a NER model from scratch is in Appendix C. In the proposed implementation, input units depend on the target tokens (a named entity) and the neighboring words that Table 1: Validation metrics of the selected NER models are explicitly encoded as a sequence using word embeddings. trained with clinical word embeddings in Spark NLP. These Similar to Fancellu et.al. (Fancellu, Lopez, and Webber 2016) NER models are trained with the datasets mentioned in the we have observed that that 95% of the scope tokens (neigh- original papers cited (Appendix D) boring words) fall in a window of 9 tokens to the left and 15 to the right of the target tokens in the same dataset. We there- number of micro macro fore implemented the same window size and used learning model entity F1 F1 rate 0.0012, dropout 0.05, batch size 64 and a maximum sen- ner anatomy 10 0.750 0.851 tence length 250. The model has been implemented within ner bionlp 15 0.638 0.748 Spark NLP as an annotator called AssertionDLModel. After ner cellular 4 0.792 0.813 training 20 epoch and measuring accuracy on the official test ner clinical 3 0.872 0.873 set, this implementation exceeds the latest state-of-the-art ner deid sd 7 0.896 0.942 accuracy benchmarks as summarized as Table 2 ner deid enriched 17 0.762 0.934 ner diseases 1 0.960 0.960 ner drugs 1 0.963 0.964 ner events 10 0.690 0.801 Table 2: Assertion detection model test metrics. Our im- jsl ner wip 76 0.842 0.863 plementation exceeds the benchmarks in the latest best ner posology 6 0.881 0.922 model (Uzuner et al. 2011) in 4 out of 6 assertion labels ner risk factors 8 0.593 0.728 - and in overall accuracy. ner human phenotype go 2 0.904 0.922 ner human phenotype gene 2 0.871 0.876 Assertion Spark Latest ner chemprot 3 0.785 0.817 Label NLP Best ner ade 2 0.824 0.852 Absent 0.944 0.937 Someone-else 0.904 0.869 Conditional 0.441 0.422 3 Assertion Status Detection in Spark NLP Hypothetical 0.862 0.890 The deep neural network architecture for assertion status de- Possible 0.680 0.630 tection in Spark NLP is based on a Bi-LSTM framework, Present 0.953 0.957 and is a modified version of the architecture proposed by micro F1 0.939 0.934 Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, condi- tionally present in the patient under certain circumstances, A sample predictions from a clinical assertion detection hypothetically present in the patient at some future point, and model can be seen at Table 3. Figure 4: The flow diagram of a Spark NLP pipeline. When we fit() on the pipeline with a Spark data frame, its text column is fed into the DocumentAssembler() transformer and a new column document is created as an initial entry point to Spark NLP for any Spark data frame. Then, its document column is fed into the SentenceDetector() module to split the text into an array of sentences and a new column “sentences” is created. Then, the “sentences” column is fed into Tokenizer(), each sentence is tokenized, and a new column “token” is created. Then, Tokens are normalized (basic text cleaning) and word embeddings are generated for each. Now data is ready to be fed into NER models and then to the assertion model. Table 3: Sample predictions from the pre-trained clinical NLP pipelines configured this way are easily reproducible, assertion detection model in Spark NLP. since they are seriablizable and directly expressed in code. They also simplify experimentation - for example, comparing Sample text : Patient with severe fever and sore throat. He multiple NER and assertion status models in the same run shows no stomach pain and is maintained on an epidural (while benefiting from the fact that data and embeddings and PCA for pain control. He also became short of breath are only loaded into memory once), or trying with different with climbing a flight of stairs. After CT, lung tumor located text cleaning steps before the NER stage (such as stopword at the right lower lobe. Father with Alzheimer. removal, lemmatization, or automated spell correction). While the CORD-19 text mining pipeline scales to process chunk entity assertion an arbitrary number of articles, for purposes of concrete severe fever PROBLEM Present demonstration the next two tables show results on a randomly sore throat PROBLEM Present sampled of 100 articles. The number of recognized named stomach pain PROBLEM Absent entities for the selected entity classes can be seen at Table 4. an epidural TREATMENT Present The number of entities detected from each document (20 PCA TREATMENT Present NER models, over 10 document) can be seen at Table 5. The pain control PROBLEM Present most frequent phrases from the selected entity types can be short of breath PROBLEM Conditional found at Table 6. The predictions from the assertion status CT TEST Present detection model for Disease Syndrome Disorder is shown in Lung tumor PROBLEM Present Table 7. Alzheimer PROBLEM Someone-else One benefit for this system compared to previous work is the variety of medical entity types that be recognized: As detailed in Appendix D, this NLP pipeline extracts over 100 4 Analysing the CORD-19 Dataset with entity types. While most clinical named entity recognition Pre-trained Models focus on symptoms, treatments, and drugs, and most biomed- ical focused projects focus on chemicals, proteins, and genes, Since assertion status labels are assigned to a medical concept this pipeline goes beyond these and can also extract: that is given as an input to the assertion detection model, NER and assertion models must work together sequentially. In • Entities related to social determinants of health such as Spark NLP, we handle this interaction by feeding the output age and gender, rate and ethnicity, diet, social history, em- of NER models to an NER converter to create chunks from ployment, relationship status, alcohol use, sexual activity labeled entities and then feed these chunks to the assertion and orientation status detection model within the same pipeline. The flow • Medical risk factors such as hypertension, smoking, choles- diagram of such a pipeline can be seen in Figure 4. As the terol, hyperlipidemia, weight and BMI, kidney disease, flow diagram shows, in Spark NLP each generated (output) pregnancy, and diabetes column is pointed to the next module as an input, depending on its input column specifications. A sample Python code for • Specific vital signs and lab results such as pulse, tempera- such a prediction pipeline can be seen at Appendix B. ture, O2 saturation,respiration, LDL and HDL This enables users to easily configure arbitrary pipelines • Detailed biomedical entity types such as organ, tissue, - such as running 20 NER pre-trained models within one gene, human phenotype, chrmical, species, amino acid, pipeline, as we do in this analysis of the CORD-19 dataset. protein, cell, cell component, biological function, chemical, Table 4: The number of entities for the selected entity classes per document from COR-19 dataset (10 documents sampled). drug ingredient medical device document id test result treatment organism chemical anatomy problem location disease species protein chem virus gene drug cell dna ade test 1 157 189 280 64 134 150 1312 944 634 188 124 608 129 30 55 229 109 130 56 254 2 277 296 137 120 155 124 1024 475 620 62 39 243 51 122 76 4 95 188 55 130 3 210 252 54 105 33 129 406 388 377 66 52 304 99 39 31 2 94 104 90 26 4 94 196 77 76 71 77 479 490 565 31 26 293 71 70 51 70 139 136 97 67 5 12 0 14 51 4 3 240 127 145 73 67 89 23 12 23 1 94 42 44 11 6 6 7 9 7 8 14 222 90 56 183 54 36 0 2 1 18 15 61 11 44 7 29 15 69 54 2 25 384 680 271 29 38 451 76 16 239 99 114 61 53 33 8 27 16 25 29 18 420 318 246 443 47 18 165 24 43 5 15 116 134 78 25 9 44 17 42 14 0 2 456 93 138 41 169 71 23 7 1 0 14 13 19 18 10 1 0 15 1 0 1 42 23 11 22 16 9 0 0 1 0 0 2 0 9 Table 5: The total number of entities from the selected NER models per document from COR-19 dataset (10 documents sampled). human phenotype gene human phenotype go chemprot clinical bacterial species anatomy coarse events clinical medmentions document id ade clinical risk factors jsl ner wip chemicals posology anatomy enriched diseases cellular clinical bionlp drugs deid 1 157 128 584 336 1313 487 387 124 62 1649 847 1904 144 182 77 129 81 435 56 254 2 277 259 713 368 948 184 167 39 51 1182 772 1429 101 75 71 125 150 139 55 130 3 210 200 511 226 510 130 61 52 90 697 633 924 165 33 75 84 148 178 90 26 4 94 93 318 214 656 41 22 26 42 771 525 957 104 18 31 110 98 196 97 67 5 12 11 83 7 219 136 88 67 9 425 211 654 21 103 12 49 41 78 44 11 6 6 5 31 25 140 215 128 54 0 442 135 590 4 80 75 66 15 35 11 44 7 29 37 178 33 637 31 20 38 20 771 404 967 95 15 25 59 59 372 53 33 8 27 22 175 441 436 138 46 18 25 579 311 728 69 28 27 102 114 91 78 25 9 44 43 106 3 319 66 77 169 19 455 445 478 26 53 24 35 17 43 19 18 10 1 4 17 1 35 33 23 16 0 64 49 98 2 19 1 1 1 8 0 9 substance, process are taking, existing conditions, or past procedures. The differ- Table 4 shows that this variety is useful in practice in ence between having assertion status detection results, and the context of COVID-19 research. On just 10 randomly being able to filter only to symptoms and drugs that positively selected documents and 20 entity types, there are over 60 impact the patient, will have a substantial impact on the accu- cases of more than a hundred instances of one entity type racy of the bottom-line results. Since more than a thousand found within one paper. Only in fewer than 10% of the cells entities are recognized in each research paper, and hundreds there were fewer there 10 entities recognized for a specific of thousands of published COVID-19 papers - doing this entity type in a specific document. This suggests that text automatically, accurately, and at scale is required. mining approaches that ignore these entity types fail to take advantage of a lot of clinical insight that the COVID-19 5 Benchmarking Speed and Scalability research papers include. The design of Spark NLP pipelines as described in Figure 4, Table 7 shows how an accurate assertion status detection where new columns are added to an existing (potentially model can help in filtering this large amount of entities - in distributed) data frame with each additional pipeline step, order to focus researchers and downstream algorithms on is optimized for parallel execution. It’s design for the case the most clinically relevant insights. In this small sample, where different rows may reside on different machines - ben- ’systemic disease’ is a present clinical condition; ’infectious efiting from the optimizations and design of Spark ML. diseases’ and ’disorders of immunity’ are hypothetical; while In order to evaluate how fast the pipeline works and how ’skin diseases and ’parvovirus’ are associated with someone effectively it scales to make use of a compute cluster, we else. ran the same Spark NLP prediction pipelines in local mode Consider a common use case of building an automated and in cluster mode. In local mode, a single Dell server with knowledge graph that links patient symptoms to drugs they 32 cores and 32 GB memory was used. In cluster mode, Table 6: The most frequent 10 terms from the selected entity types predicted through parsing 100 articles from CORD-19 dataset (Wang et al. 2020) with an NER model named jsl ner wip in Spark NLP. Getting predictions from the model, we can get some valuable information regarding the most frequent disorders or symptoms mentioned in the papers or the most common vital and EKG findings without reading the paper. According to this table, the most common symptom is cough and inflammation while the most common drug ingredients mentioned is oseltamivir and antibiotics. We can also say that cardiogenic oscillations and ventricular fibrillation are the common observations from EKGs while fever and hyphothermia are the most common vital signs. Disease Syndrome Communicable Drug Vital Sign EKG Disorder Disease Symptom Ingredient Procedure Findings Findings infectious diseases HIV cough oseltamivir resuscitation fever low VT sepsis H1N1 inflammation biological agents cardiac surgery hypothermia cardiogenic oscillations influenza tuberculosis critically ill VLPs tracheostomy hypoxia significant changes septic shock influenza necrosis antibiotics CPR respiratory failure CO reduces oxygen transport asthma TB bleeding saline vaccination hypotension ventricular fibrillation pneumonia hepatitis viruses lesion antiviral bronchoscopy hypercapnia significant impedance increases COPD measles cell swelling quercetin intubation tachypnea ventricular fibrillation gastroenteritis pandemic influenza hemorrhage NaCl transfection respiratory distress pulseless electrical activity viral infections seasonal influenza diarrhea ribavirin bronchoalveolar lavage hypoxaemia mildmoderate hypothermia SARS rabies toxicity Norwalk agent autopsy pyrexia cardiogenic oscillations Table 7: A sample assertion status labels for a set of entities detected by an NER model as Disease Syndrome Disorder out of CORD-19 dataset. chunk assertion systemic disease Present skin diseases Someone-else vascular disorders Possible infectious diseases Hypothetical disorders of immunity Hypothetical infectious disease Hypothetical word malacia Present chapter-necrosis Hypothetical parvovirus Someone-else 10 machines with 32 GB and 16 cores each were used, in a Databricks cluster on AWS. The performance results are shared in Figure 5. These benchmarks show that tokenization is 20x faster while the entity extraction is 3.5x faster on the cluster, com- Figure 5: Comparing the Spark NLP document parsing pared to the single machine run. It indicates that speedup pipeline in standalone and cluster mode. Tests show that depends on the complexity of the task. For example, tokeniza- tokenization is 20x faster while the entity extraction is 3.5x tion provides super-linear speedup (i.e. growing machines faster in cluster mode when compared to standalone mode. by 10x improves speed by more than 10x), while NER deliv- ers sub-linear speedup (because it’s a more computationally complex task). tion is a useful filter on these entities. This bodes well for the richness of downstream analysis that can be done using 6 Conclusion this now structured and normalized data - such as clustering, In this study, we introduced a set of pretrained named entity dimensionality reduction, semantic similarity, visualization, recognition and assertion status detection models that are or graph-based analysis to identity correlated concepts. One trained on biomedical and clinical datasets with deep learn- future research direction is to apply these downstream anal- ing architectures on top of Spark NLP. We then present how yses on the richer, scalable, and more accurate insights that to extract relevant facts from the CORD-19 dataset by ap- this NLP pipeline generates. plying state-of-the-art NER and assertion status models in a Since NER and assertion status models in Spark NLP are unified & scalable pipeline and shared the results to illustrate trainable, it is easy to add support for a new language like extracting valuable information from scientific papers. German, French, or Spanish, as long as there is a annotated The results suggest that papers present in the CORD-19 data for it. Spark NLP currently supports 46 languages and include a wide variety of the many entity types that this new 3 languages for Healthcare - English, German and Spanish. NLP pipeline can recognize, and that assertion status detec- Spark NLP provides production-grade libraries for popular programming languages - Python, Scala, Java and R - and Perera, G.; Khondoker, M.; Broadbent, M.; Breen, G.; and has an active community, frequent releases, public documen- Stewart, R. 2014. Factors associated with response to acetyl- tation and freely available code examples. Future work in this cholinesterase inhibition in dementia: a cohort study from a space includes adding support for additional languages, addi- secondary mental health care case register in London. PloS tional entity types, and extending the NLP pipeline further one 9(11): e109484. by adding relation extraction and entity resolution models. Pyysalo, S.; and Ananiadou, S. 2014. Anatomical entity mention recognition at literature scale. Bioinformatics 30(6): References 868–875. Cheng, X.; Cao, Q.; and Liao, S. 2020. An overview of litera- Segura Bedmar, I.; Martı́nez, P.; and Herrero Zazo, M. 2013. ture on COVID-19, MERS and SARS: Using text mining and Semeval-2013 task 9: Extraction of drug-drug interactions latent Dirichlet allocation. Journal of Information Science . from biomedical texts (ddiextraction 2013). Association for Computational Linguistics. Chiu, J. P.; and Nichols, E. 2016. Named entity recogni- tion with bidirectional LSTM-CNNs. Transactions of the Sousa, D.; Lamurias, A.; and Couto, F. M. 2019. A silver Association for Computational Linguistics 4: 357–370. standard corpus of human phenotype-gene relations. arXiv preprint arXiv:1903.10728 . da Silva, J. A. T.; Tsigaris, P.; and Erfanmanesh, M. 2020. Publishing volumes in major databases related to Covid-19. Stubbs, A.; Kotfila, C.; Xu, H.; and Uzuner, Ö. 2015. Identify- Scientometrics 1 – 12. ing risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2. Journal of biomedical Doğan, R. I.; Leaman, R.; and Lu, Z. 2014. NCBI disease informatics 58: S67–S77. corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 47: 1–10. Sun, W.; Rumshisky, A.; and Uzuner, O. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Jour- Fancellu, F.; Lopez, A.; and Webber, B. 2016. Neural net- nal of the American Medical Informatics Association 20(5): works for negation scope detection. In Proceedings of the 806–813. 54th annual meeting of the Association for Computational Tzitzivacos, D. 2007. International Classification of Diseases Linguistics (volume 1: long papers), 495–504. 10th edition (ICD-10):: main article. CME: Your SA Journal Henry, S.; Buchan, K.; Filannino, M.; Stubbs, A.; and Uzuner, of CPD 25(1): 8–10. O. 2020. 2018 n2c2 shared task on adverse drug events and Uzuner, Ö.; Luo, Y.; and Szolovits, P. 2007. Evaluating the medication extraction in electronic health records. Journal of state-of-the-art in automatic de-identification. Journal of the the American Medical Informatics Association 27(1): 3–12. American Medical Informatics Association 14(5): 550–563. Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-Wei, H. L.; Feng, Uzuner, Ö.; South, B. R.; Shen, S.; and DuVall, S. L. 2011. M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and 2010 i2b2/VA challenge on concepts, assertions, and relations Mark, R. G. 2016. MIMIC-III, a freely accessible critical in clinical text. Journal of the American Medical Informatics care database. Scientific data 3(1): 1–9. Association 18(5): 552–556. Kim, J.-D.; Ohta, T.; Tsuruoka, Y.; Tateisi, Y.; and Collier, Wang, L. L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.; N. 2004. Introduction to the bio-entity recognition task at Eide, D.; Funk, K.; Kinney, R.; Liu, Z.; Merrill, W.; et al. JNLPBA. In Proceedings of the international joint work- 2020. CORD-19: The Covid-19 Open Research Dataset. shop on natural language processing in biomedicine and its ArXiv . applications, 70–75. Citeseer. Yadav, V.; and Bethard, S. 2019. A survey on recent advances Liede, A.; Hernandez, R. K.; Roth, M.; Calkins, G.; Larrabee, in named entity recognition from deep learning models. arXiv K.; and Nicacio, L. 2015. Validation of International Classi- preprint arXiv:1910.11470 . fication of Diseases coding for bone metastases in electronic health records using technology-enabled abstraction. Clinical epidemiology 7: 441. Liu, S.; Tang, B.; Chen, Q.; and Wang, X. 2015. Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually con- structed dictionaries. Information 6(4): 848–865. Murdoch, T. B.; and Detsky, A. S. 2013. The inevitable application of big data to health care. Jama 309(13): 1351– 1352. Nédellec, C.; Bossy, R.; Kim, J.-D.; Kim, J.-J.; Ohta, T.; Pyysalo, S.; and Zweigenbaum, P. 2013. Overview of BioNLP shared task 2013. In Proceedings of the BioNLP shared task 2013 workshop, 1–7. Appendices nlpPipeline = Pipeline(stages=[ documentAssembler, A NER Model Training Tagging Schema sentenceDetector, BIO (Begin, Inside and Outside) and BIOES (Begin, Inside, tokenizer, Outside, End, Single) schemes for encoding entity annota- word_embeddings, tions as token tags. Words tagged with O are outside of named clinical_ner, ner_converter, entities and the I-XXX tag is used for words inside a named clinical_assertion entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX to highlight that it starts another entity. On the other hand, BIOES (also known as BIOLU) is C Training an NER Model in Spark NLP a little bit sophisticated annotation method that distinguishes between the end of a named entity and single entities. BIOES stands for Begin, Inside, Outside, End, Single. In this scheme, from pyspark.ml import Pipeline for example, a word describing a gene entity is tagged with import sparknlp from sparknlp.training import CoNLL “B-Gene” if it is at the beginning of the entity, “I-Gene” if from sparknlp.annotator import * it is in the middle of the entity, and “E-Gene” if it is at the end of the entity. Single-word gene entities are tagged with spark = sparknlp.start() “S-Gene”. All other words not describing entities of interest are tagged as ‘O’. training_data = CoNLL().readDataset(spark, ’ BC5CDR_train.conll’) B Defining a Spark NLP Pipeline word_embedder = WordEmbeddings.pretrained(’ wikiner_6B_300’, ’xx’) \ from sparknlp_jsl.annotator import * .setInputCols(["sentence",’token’])\ .setOutputCol("embeddings") documentAssembler = DocumentAssembler()\ .setInputCol("text")\ nerTagger = NerDLApproach()\ .setOutputCol("document") .setInputCols(["sentence", "token", " embeddings"])\ sentenceDetector = SentenceDetector()\ .setLabelColumn("label")\ .setInputCols(["document"])\ .setOutputCol("ner")\ .setOutputCol("sentence") .setMaxEpochs(10)\ .setDropout(0.5)\ tokenizer = Tokenizer()\ .setLr(0.001)\ .setInputCols(["sentence"])\ .setPo(0.005)\ .setOutputCol("token") .setBatchSize(8)\ .setValidationSplit(0.2)\ word_embeddings = WordEmbeddingsModel. pretrained("embeddings_clinical", "en", pipeline = Pipeline( "clinical/models")\ stages = [ .setInputCols(["sentence", "token"])\ word_embedder, .setOutputCol("embeddings") nerTagger ]) clinical_ner = NerDLModel.pretrained(" ner_clinical", "en", "clinical/models") ner_model = pipeline.fit(training_data) \ .setInputCols(["sentence", "token", " embeddings"]) \ D Pretrained NER Models and Entities .setOutputCol("ner") Covered ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner" ner anatomy coarse ]) \ (Pyysalo and Ananiadou 2014) .setOutputCol("ner_chunk") Entities: anatomy clinical_assertion = AssertionDLModel. pretrained("assertion_dl", "en", " ner anatomy clinical/models") \ .setInputCols(["sentence", "ner_chunk", Entities: organism substance, organ, cellular component, "embeddings"]) \ immaterial anatomical entity, tissue, organism subdivision, .setOutputCol("assertion") anatomical system, cell, pathological formation, develop- ing anatomical structure, multi ner bionlp communicable disease, psychological condition, hyperten- (Nédellec et al. 2013) sion, direction, o2 saturation, hyperlipidemia, imagingfind- Entities: cellular component, organ, cancer, or- ings, vs finding, allergen, dosage, kidney disease, bmi, ganism substance, multi, simple chemical, tis- smoking, pulse, ldl, symptom, labour delivery, relation- sue, anatomical system, organism subdivision, ship status, external body part or region, hdl, respiration, immaterial anatomical entity, organism, develop- procedure, height, vital signs header, relativetime, rela- ing anatomical structure, amino acid, gene or gene product, tivedate, injury or poisoning, medical device, test result, pathological formation, cell duration, age, admission discharge, ner medmentions coarse, pathologic function, geographic area, group, diagnos- ner cellular tic procedure, organic chemical, organism attribute, mental or behavioral dysfunction, organization, re- (Kim et al. 2004) search activity, therapeutic or preventive procedure, Entities: dna, cell line, cell type, rna, protein biomedical or dental material, mammal, genetic function, body system, substance, daily or recreational activity, quan- ner clinical titative concept, health care activity, molecular function, (Uzuner et al. 2011) indicator, reagent, or diagnostic aid, body substance, virus, Entities: treatment, problem, test eukaryote, disease or syndrome, spatial concept, anatom- ical structure, body part, organ, or organ component, ner deid laboratory procedure, sign or symptom, nu- (Stubbs et al. 2015) cleic acid, nucleoside, or nucleotide, food, men- Entities: location, contact, date, profession, name, age, id tal process, prokaryote, nucleotide sequence, profes- sional or occupational group, cell, biologic function, ner deid enriched manufactured object, molecular biology research technique, (Stubbs et al. 2015) gene or genome, chemical, neoplastic process, phar- Entities: idnum, country, date, profession, medicalrecord, macologic substance, tissue, qualitative concept, username, organization, zip, id, healthplan, location, device, amino acid, peptide, or protein, fungus, popula- hospital, city, email, doctor, street, state, patient, bioid, url, tion group, body location or region, clinical attribute, phone, fax, age injury or poisoning, medical device, cell component, plant ner diseases ner posology (Doğan, Leaman, and Lu 2014) (Henry et al. 2020) Entities: disease Entities: form, dosage, strength, drug, route, frequency, duration ner drugs ner risk factors (Henry et al. 2020), (Segura Bedmar, Martı́nez, and Her- rero Zazo 2013) (Stubbs et al. 2015) Entities: drug Entities: family hist, smoker, obese, medication, hyper- tension, hyperlipidemia, phi, diabetes, cad ner events clinical ner human phenotype go clinical (Sun, Rumshisky, and Uzuner 2013) (Sousa, Lamurias, and Couto 2019) Entities: test, problem, clinical dept, occurrence, date, Entities: go, hp time, evidential, treatment, frequency, duration ner human phenotype gene clinical jsl ner wip clinical (Sousa, Lamurias, and Couto 2019) (in-house annotations from mtsamples and MIMIC-III (John- Entities: gene, hp son et al. 2016)) Entities: triglycerides, oncological, fe- ner chemprot clinical male reproductive status, form, time, date, alcohol, Entities: gene, chemical medical history header, race ethnicity, temperature, drug brandname, frequency, fetus newborn, sexu- ner ade clinical ally active or sexual orientation, disease syndrome disorder, section header, social history header, strength, cerebrovas- Entities: ade, drug cular disease, family history header, employment, weight, pregnancy, total cholesterol, diet, ekg findings, gender, ner chemicals drug ingredient, vaccine, substance, oxygen therapy, inter- Entities: chem nal organ or component, blood pressure, overweight, obe- sity, birth entity, heart disease, diabetes, substance quantity, ner bacterial species treatment, death entity, route, modifier, test, clinical dept, Entities: species