Improving Clinical Document Understanding
                                  on COVID-19 Research with Spark NLP
                                                   Veysel Kocaman, David Talby
                                                            John Snow Labs Inc.
                                                         16192 Coastal Highway
                                                         Lewes, DE , USA 19958
                                                    {veysel, david}@johnsnowlabs.com

                            Abstract                                  care of their patients. Information fed into these systems may
                                                                      be found in structured fields for which values are inputted
  Following the global COVID-19 pandemic, the number of sci-
  entific papers studying the virus has grown massively, leading      electronically (e.g. laboratory test orders or results) (Liede
  to increased interest in automated literate review. We present a    et al. 2015) but most of the time information in these records
  clinical text mining system that improves on previous efforts       is unstructured making it largely inaccessible for statistical
  in three ways. First, it can recognize over 100 different entity    analysis (Murdoch and Detsky 2013). These records include
  types including social determinants of health, anatomy, risk        information such as the reason for administering drugs, previ-
  factors, and adverse events in addition to other commonly used      ous disorders of the patient or the outcome of past treatments,
  clinical and biomedical entities. Second, the text processing       and they are the largest source of empirical data in biomedi-
  pipeline includes assertion status detection, to distinguish be-    cal research, allowing for major scientific findings in highly
  tween clinical facts that are present, absent, conditional, or      relevant disorders such as cancer and Alzheimer’s disease
  about someone other than the patient. Third, the deep learn-
                                                                      (Perera et al. 2014).
  ing models used are more accurate than previously available,
  leveraging an integrated pipeline of state-of-the-art pre-trained      A primary building block in such text mining systems is
  named entity recognition models, and improving on the previ-        named entity recognition (NER) - which is regarded as a
  ous best performing benchmarks for assertion status detection.      critical precursor for question answering, topic modelling,
  We illustrate extracting trends and insights - e.g. most frequent   information retrieval, etc (Yadav and Bethard 2019). In the
  disorders and symptoms, and most common vital signs and             medical domain, NER recognizes the first meaningful chunks
  EKG findings – from the COVID-19 Open Research Dataset              out of a clinical note, which are then fed down the processing
  (CORD-19). The system is built using the Spark NLP library          pipeline as an input to subsequent downstream tasks such as
  which natively supports scaling to use distributed clusters,        clinical assertion status detection (Uzuner et al. 2011), clini-
  leveraging GPU’s, configurable and reusable NLP pipelines,
  healthcare-specific embeddings, and the ability to train models
                                                                      cal entity resolution (Tzitzivacos 2007) and de-identification
  to support new entity types or human languages with no code         of sensitive data (Uzuner, Luo, and Szolovits 2007) (see Fig-
  changes.                                                            ure 1). However, segmentation of clinical and drug entities is
                                                                      considered to be a difficult task in biomedical NER systems
                                                                      because of complex orthographic structures of named entities
                      1    Introduction                               (Liu et al. 2015).
The COVID-19 pandemic brought a surge of academic re-                    The next step following an NER model in the clinical NLP
search about the virus - resulting in 23,634 new publications         pipeline is to assign an assertion status to each named entity
between January and June of 2020 (da Silva, Tsigaris, and             given its context. The status of an assertion explains how
Erfanmanesh 2020) and accelerating to 8,800 additions per             a named entity (e.g. clinical finding, procedure, lab result)
week from June to November on the COVID-19 Open Re-                   pertains to the patient by assigning a label such as present
search Dataset (Wang et al. 2020). Such a high volume of              (”patient is diabetic”), absent (”patient denies nausea”), con-
publications makes it impossible for researchers to read each         ditional (”dyspnea while climbing stairs”), or associated with
publication, resulting in increased interest in applying natural      someone else (”family history of depression”). In the context
language processing (NLP) and text mining techniques to               of COVID-19, applying an accurate assertion status detection
enable semi-automated literature review (Cheng, Cao, and              is crucial, since most patients will be tested for and asked
Liao 2020).                                                           about the same set of symptoms and comorbidities - so lim-
   In parallel, there is a growing need for automated text            iting a text mining pipeline to recognizing medical terms
mining of Electronic health records (EHRs) in order to find           without context is not useful in practice.
clinical indications that new research points to. EHRs are               In this study, we introduce a set of pre-trained NER models
the primary source of information for clinicians tracking the         that are all trained on biomedical and clinical datasets within
Copyright © 2021 for this paper by its authors. Use permitted under   a Bi-LSTM-CNN-Char deep learning architecture, and a Bi-
Creative Commons License Attribution 4.0 International (CC BY         LSTM based assertion detection module built on top of the
4.0).                                                                 Spark NLP software library. We then illustrate how to ex-
Figure 1: Named Entity Recognition is a fundamental building block of medical text mining pipelines, and feeds downstream
tasks such as assertion status, entity linking, de-identification, and relation extraction.


tract knowledge and relevant information from unstructured         network architecture that automatically detects word and
electronic health records (EHR) and COVID-19 Open Re-              character-level features using a hybrid bidirectional LSTM
search Dataset (CORD-19) by combining these models in a            and CNN architecture, eliminating the need for most feature
pipeline. Using state-of-the-art deep learning architectures,      engineering steps. The detailed architecture of the framework
Spark NLP’s NER and Assertion modules can also be ex-              in the original paper is illustrated at Figure 2 and a sample
tended to other spoken languages with zero code changes and        predictions from a set of pre-trained clinical NER models
can scale up in Spark clusters. Moreover, by utilizing Apache      from a text taken from CORD-19 dataset is shown in 3.
Spark, both training and inference of full NLP pipelines can
scale to make the most of distributed Spark clusters. Due
to brevity concerns, the implementation details and training
metrics of these models will be kept out of the scope of this
study.
   The specific novel contributions of this paper are:
• Introducing a medical text mining pipeline composed of
  state-of-the-art, healthcare-specific NER models
• Introducing a clinical assertion status detection model that
  establishes a new state-of-the-art level of accuracy on a
  widely used benchmark
• Describing how to apply these models in a unified, per-
  formant, and scalable pipeline on documents from the
  CORD-19 dataset.
   The remainder of the paper is organized as follows: Sec-
tion 2 Introduces the Spark NLP library, summarizes the NER
and assertion detection model frameworks it implements, and        Figure 2: Overview of the original BiLSTM-CNN-Char ar-
elaborates the named entities in each pre-trained NER model.       chitecture (Chiu and Nichols 2016).
Section 4 explains how to build a prediction pipeline to ex-
tract named entities and assign assertion statuses from a set of
documents on a cluster with Spark NLP. Section 5 discusses            In Spark NLP, this architecture is implemented using Ten-
benchmarking speed and scalability issues and Section 6            sorFlow, and has been heavily optimized for accuracy, speed,
concludes this paper by summarizing key points and future          scalability, and memory utilization. This setup has been
directions.                                                        tightly integrated with Apache Spark to let the driver node
                                                                   run the entire training using all the available cores on the
                                                                   driver node. There is a CuDA version of each TensorFlow
 2    Named Entity Recognition in Spark NLP                        component to enable training models on GPU when available.
The deep neural network architecture for named entity recog-       The Spark NLP provides open-source API’s in Python, Java,
nition in Spark NLP is based on the BiLSTM-CNN-Char                Scala, and R - so that users do not need to be aware of the
framework. It is a modified version of the architecture pro-       underlying implementation details (TensorFlow, Spark, etc.)
posed by Chiu et.al. (Chiu and Nichols 2016). It is a neural       in order to use it.
               Figure 3: Sample predictions from pre-trained clinical NER models in Spark NLP for Healthcare


   The full list of the entities for each pre-trained medical     mentioned in the patient report but associated with someone-
NER model is available in Appendix D, the accuracy metrics        else (Uzuner et al. 2011).
are given in Table 1 and a sample Python code for training a
NER model from scratch is in Appendix C.                             In the proposed implementation, input units depend on the
                                                                  target tokens (a named entity) and the neighboring words that
Table 1: Validation metrics of the selected NER models            are explicitly encoded as a sequence using word embeddings.
trained with clinical word embeddings in Spark NLP. These         Similar to Fancellu et.al. (Fancellu, Lopez, and Webber 2016)
NER models are trained with the datasets mentioned in the         we have observed that that 95% of the scope tokens (neigh-
original papers cited (Appendix D)                                boring words) fall in a window of 9 tokens to the left and 15
                                                                  to the right of the target tokens in the same dataset. We there-
                                number of    micro   macro        fore implemented the same window size and used learning
  model                           entity      F1      F1          rate 0.0012, dropout 0.05, batch size 64 and a maximum sen-
  ner anatomy                   10           0.750   0.851        tence length 250. The model has been implemented within
  ner bionlp                    15           0.638   0.748        Spark NLP as an annotator called AssertionDLModel. After
  ner cellular                  4            0.792   0.813        training 20 epoch and measuring accuracy on the official test
  ner clinical                  3            0.872   0.873        set, this implementation exceeds the latest state-of-the-art
  ner deid sd                   7            0.896   0.942        accuracy benchmarks as summarized as Table 2
  ner deid enriched             17           0.762   0.934
  ner diseases                  1            0.960   0.960
  ner drugs                     1            0.963   0.964
  ner events                    10           0.690   0.801        Table 2: Assertion detection model test metrics. Our im-
  jsl ner wip                   76           0.842   0.863        plementation exceeds the benchmarks in the latest best
  ner posology                  6            0.881   0.922        model (Uzuner et al. 2011) in 4 out of 6 assertion labels
  ner risk factors              8            0.593   0.728        - and in overall accuracy.
  ner human phenotype go        2            0.904   0.922
  ner human phenotype gene      2            0.871   0.876                          Assertion      Spark    Latest
  ner chemprot                  3            0.785   0.817                           Label         NLP       Best
  ner ade                       2            0.824   0.852
                                                                                 Absent            0.944    0.937
                                                                                 Someone-else      0.904    0.869
                                                                                 Conditional       0.441    0.422
 3    Assertion Status Detection in Spark NLP                                    Hypothetical      0.862    0.890
The deep neural network architecture for assertion status de-                    Possible          0.680    0.630
tection in Spark NLP is based on a Bi-LSTM framework,                            Present           0.953    0.957
and is a modified version of the architecture proposed by                        micro F1          0.939    0.934
Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal
is to classify the assertions made on given medical concepts
as being present, absent, or possible in the patient, condi-
tionally present in the patient under certain circumstances,       A sample predictions from a clinical assertion detection
hypothetically present in the patient at some future point, and   model can be seen at Table 3.
Figure 4: The flow diagram of a Spark NLP pipeline. When we fit() on the pipeline with a Spark data frame, its text column is
fed into the DocumentAssembler() transformer and a new column document is created as an initial entry point to Spark NLP
for any Spark data frame. Then, its document column is fed into the SentenceDetector() module to split the text into an array
of sentences and a new column “sentences” is created. Then, the “sentences” column is fed into Tokenizer(), each sentence is
tokenized, and a new column “token” is created. Then, Tokens are normalized (basic text cleaning) and word embeddings are
generated for each. Now data is ready to be fed into NER models and then to the assertion model.


Table 3: Sample predictions from the pre-trained clinical         NLP pipelines configured this way are easily reproducible,
assertion detection model in Spark NLP.                           since they are seriablizable and directly expressed in code.
                                                                  They also simplify experimentation - for example, comparing
Sample text : Patient with severe fever and sore throat. He       multiple NER and assertion status models in the same run
shows no stomach pain and is maintained on an epidural            (while benefiting from the fact that data and embeddings
and PCA for pain control. He also became short of breath          are only loaded into memory once), or trying with different
with climbing a flight of stairs. After CT, lung tumor located    text cleaning steps before the NER stage (such as stopword
at the right lower lobe. Father with Alzheimer.                   removal, lemmatization, or automated spell correction).
                                                                     While the CORD-19 text mining pipeline scales to process
        chunk             entity          assertion               an arbitrary number of articles, for purposes of concrete
        severe fever      PROBLEM         Present                 demonstration the next two tables show results on a randomly
        sore throat       PROBLEM         Present                 sampled of 100 articles. The number of recognized named
        stomach pain      PROBLEM         Absent                  entities for the selected entity classes can be seen at Table 4.
        an epidural       TREATMENT       Present                 The number of entities detected from each document (20
        PCA               TREATMENT       Present                 NER models, over 10 document) can be seen at Table 5. The
        pain control      PROBLEM         Present                 most frequent phrases from the selected entity types can be
        short of breath   PROBLEM         Conditional             found at Table 6. The predictions from the assertion status
        CT                TEST            Present                 detection model for Disease Syndrome Disorder is shown in
        Lung tumor        PROBLEM         Present                 Table 7.
        Alzheimer         PROBLEM         Someone-else               One benefit for this system compared to previous work
                                                                  is the variety of medical entity types that be recognized: As
                                                                  detailed in Appendix D, this NLP pipeline extracts over 100
    4      Analysing the CORD-19 Dataset with                     entity types. While most clinical named entity recognition
                   Pre-trained Models                             focus on symptoms, treatments, and drugs, and most biomed-
                                                                  ical focused projects focus on chemicals, proteins, and genes,
Since assertion status labels are assigned to a medical concept   this pipeline goes beyond these and can also extract:
that is given as an input to the assertion detection model, NER
and assertion models must work together sequentially. In          • Entities related to social determinants of health such as
Spark NLP, we handle this interaction by feeding the output         age and gender, rate and ethnicity, diet, social history, em-
of NER models to an NER converter to create chunks from             ployment, relationship status, alcohol use, sexual activity
labeled entities and then feed these chunks to the assertion        and orientation
status detection model within the same pipeline. The flow         • Medical risk factors such as hypertension, smoking, choles-
diagram of such a pipeline can be seen in Figure 4. As the          terol, hyperlipidemia, weight and BMI, kidney disease,
flow diagram shows, in Spark NLP each generated (output)            pregnancy, and diabetes
column is pointed to the next module as an input, depending
on its input column specifications. A sample Python code for      • Specific vital signs and lab results such as pulse, tempera-
such a prediction pipeline can be seen at Appendix B.               ture, O2 saturation,respiration, LDL and HDL
   This enables users to easily configure arbitrary pipelines     • Detailed biomedical entity types such as organ, tissue,
- such as running 20 NER pre-trained models within one              gene, human phenotype, chrmical, species, amino acid,
pipeline, as we do in this analysis of the CORD-19 dataset.         protein, cell, cell component, biological function, chemical,
 Table 4: The number of entities for the selected entity classes per document from COR-19 dataset (10 documents sampled).


                                                                                                                                                                                drug ingredient


                                                                                                                                                                                                                             medical device
  document id


                                                                                                                                                                                                      test result
                                                                                                      treatment
                                           organism


                                                                                                                                                                                                                                                                                    chemical
                anatomy


                                                                                       problem


                                                                                                                                 location

                                                                                                                                             disease


                                                                                                                                                                                                                                                                                                                                      species
                                                                            protein


                                                                                                                                                                                                                                                                                                                         chem
                                                                                                                                                                                                                                                            virus


                                                                                                                                                                                                                                                                                                         gene
                                                                                                                                                                drug
                                 cell


                                                                 dna
                                                      ade


                                                                                                                   test
  1              157             189       280        64         134        150        1312           944          634           188         124                608             129                   30                     55                             229                     109                  130             56           254
  2              277             296       137        120        155        124        1024           475          620           62          39                 243             51                    122                    76                             4                       95                   188             55           130
  3              210             252       54         105        33         129        406            388          377           66          52                 304             99                    39                     31                             2                       94                   104             90           26
  4              94              196       77         76         71         77         479            490          565           31          26                 293             71                    70                     51                             70                      139                  136             97           67
  5              12              0         14         51         4          3          240            127          145           73          67                 89              23                    12                     23                             1                       94                   42              44           11
  6              6               7         9          7          8          14         222            90           56            183         54                 36              0                     2                      1                              18                      15                   61              11           44
  7              29              15        69         54         2          25         384            680          271           29          38                 451             76                    16                     239                            99                      114                  61              53           33
  8              27              16        25         29         18         420        318            246          443           47          18                 165             24                    43                     5                              15                      116                  134             78           25
  9              44              17        42         14         0          2          456            93           138           41          169                71              23                    7                      1                              0                       14                   13              19           18
  10             1               0         15         1          0          1          42             23           11            22          16                 9               0                     0                      1                              0                       0                    2               0            9

Table 5: The total number of entities from the selected NER models per document from COR-19 dataset (10 documents sampled).


                                                                                                                                                                                                                                                             human phenotype gene
                                                                                                                                                                                                                                       human phenotype go


                                                                                                                                                                                                                                                                                     chemprot clinical


                                                                                                                                                                                                                                                                                                                                       bacterial species
                anatomy coarse


                                                                                                                           events clinical


                                                                                                                                                                medmentions
  document id


                                                                                                                                                                                                                                                                                                          ade clinical
                                                                                                                                                                                                              risk factors
                                                                                                                                              jsl ner wip


                                                                                                                                                                                                                                                                                                                          chemicals
                                                                                                                                                                                           posology
                                 anatomy


                                                                                           enriched

                                                                                                        diseases
                                                      cellular

                                                                 clinical
                                           bionlp


                                                                                                                   drugs
                                                                                deid


  1              157             128       584        336        1313           487        387         124         62      1649               847               1904                       144                182                      77                    129                     81                   435             56           254
  2              277             259       713        368        948            184        167         39          51      1182               772               1429                       101                75                       71                    125                     150                  139             55           130
  3              210             200       511        226        510            130        61          52          90      697                633               924                        165                33                       75                    84                      148                  178             90           26
  4              94              93        318        214        656            41         22          26          42      771                525               957                        104                18                       31                    110                     98                   196             97           67
  5              12              11        83         7          219            136        88          67          9       425                211               654                        21                 103                      12                    49                      41                   78              44           11
  6              6               5         31         25         140            215        128         54          0       442                135               590                        4                  80                       75                    66                      15                   35              11           44
  7              29              37        178        33         637            31         20          38          20      771                404               967                        95                 15                       25                    59                      59                   372             53           33
  8              27              22        175        441        436            138        46          18          25      579                311               728                        69                 28                       27                    102                     114                  91              78           25
  9              44              43        106        3          319            66         77          169         19      455                445               478                        26                 53                       24                    35                      17                   43              19           18
  10             1               4         17         1          35             33         23          16          0       64                 49                98                         2                  19                       1                     1                       1                    8               0            9


     substance, process                                                                                                                      are taking, existing conditions, or past procedures. The differ-
   Table 4 shows that this variety is useful in practice in                                                                                  ence between having assertion status detection results, and
the context of COVID-19 research. On just 10 randomly                                                                                        being able to filter only to symptoms and drugs that positively
selected documents and 20 entity types, there are over 60                                                                                    impact the patient, will have a substantial impact on the accu-
cases of more than a hundred instances of one entity type                                                                                    racy of the bottom-line results. Since more than a thousand
found within one paper. Only in fewer than 10% of the cells                                                                                  entities are recognized in each research paper, and hundreds
there were fewer there 10 entities recognized for a specific                                                                                 of thousands of published COVID-19 papers - doing this
entity type in a specific document. This suggests that text                                                                                  automatically, accurately, and at scale is required.
mining approaches that ignore these entity types fail to take
advantage of a lot of clinical insight that the COVID-19                                                                                                    5                 Benchmarking Speed and Scalability
research papers include.                                                                                                                     The design of Spark NLP pipelines as described in Figure 4,
   Table 7 shows how an accurate assertion status detection                                                                                  where new columns are added to an existing (potentially
model can help in filtering this large amount of entities - in                                                                               distributed) data frame with each additional pipeline step,
order to focus researchers and downstream algorithms on                                                                                      is optimized for parallel execution. It’s design for the case
the most clinically relevant insights. In this small sample,                                                                                 where different rows may reside on different machines - ben-
’systemic disease’ is a present clinical condition; ’infectious                                                                              efiting from the optimizations and design of Spark ML.
diseases’ and ’disorders of immunity’ are hypothetical; while                                                                                   In order to evaluate how fast the pipeline works and how
’skin diseases and ’parvovirus’ are associated with someone                                                                                  effectively it scales to make use of a compute cluster, we
else.                                                                                                                                        ran the same Spark NLP prediction pipelines in local mode
   Consider a common use case of building an automated                                                                                       and in cluster mode. In local mode, a single Dell server with
knowledge graph that links patient symptoms to drugs they                                                                                    32 cores and 32 GB memory was used. In cluster mode,
Table 6: The most frequent 10 terms from the selected entity types predicted through parsing 100 articles from CORD-19
dataset (Wang et al. 2020) with an NER model named jsl ner wip in Spark NLP. Getting predictions from the model, we can get
some valuable information regarding the most frequent disorders or symptoms mentioned in the papers or the most common vital
and EKG findings without reading the paper. According to this table, the most common symptom is cough and inflammation
while the most common drug ingredients mentioned is oseltamivir and antibiotics. We can also say that cardiogenic oscillations
and ventricular fibrillation are the common observations from EKGs while fever and hyphothermia are the most common vital
signs.

  Disease Syndrome       Communicable                             Drug                                         Vital Sign                     EKG
      Disorder             Disease
                                            Symptom             Ingredient
                                                                                 Procedure                     Findings                     Findings
 infectious diseases   HIV                  cough            oseltamivir         resuscitation            fever                  low VT
 sepsis                H1N1                 inflammation     biological agents   cardiac surgery          hypothermia            cardiogenic oscillations
 influenza             tuberculosis         critically ill   VLPs                tracheostomy             hypoxia                significant changes
 septic shock          influenza            necrosis         antibiotics         CPR                      respiratory failure    CO reduces oxygen transport
 asthma                TB                   bleeding         saline              vaccination              hypotension            ventricular fibrillation
 pneumonia             hepatitis viruses    lesion           antiviral           bronchoscopy             hypercapnia            significant impedance increases
 COPD                  measles              cell swelling    quercetin           intubation               tachypnea              ventricular fibrillation
 gastroenteritis       pandemic influenza   hemorrhage       NaCl                transfection             respiratory distress   pulseless electrical activity
 viral infections      seasonal influenza   diarrhea         ribavirin           bronchoalveolar lavage   hypoxaemia             mildmoderate hypothermia
 SARS                  rabies               toxicity         Norwalk agent       autopsy                  pyrexia                cardiogenic oscillations


Table 7: A sample assertion status labels for a set of entities
detected by an NER model as Disease Syndrome Disorder
out of CORD-19 dataset.

            chunk                           assertion
            systemic disease                Present
            skin diseases                   Someone-else
            vascular disorders              Possible
            infectious diseases             Hypothetical
            disorders of immunity           Hypothetical
            infectious disease              Hypothetical
            word malacia                    Present
            chapter-necrosis                Hypothetical
            parvovirus                      Someone-else


10 machines with 32 GB and 16 cores each were used, in
a Databricks cluster on AWS. The performance results are
shared in Figure 5.
   These benchmarks show that tokenization is 20x faster
while the entity extraction is 3.5x faster on the cluster, com-                       Figure 5: Comparing the Spark NLP document parsing
pared to the single machine run. It indicates that speedup                            pipeline in standalone and cluster mode. Tests show that
depends on the complexity of the task. For example, tokeniza-                         tokenization is 20x faster while the entity extraction is 3.5x
tion provides super-linear speedup (i.e. growing machines                             faster in cluster mode when compared to standalone mode.
by 10x improves speed by more than 10x), while NER deliv-
ers sub-linear speedup (because it’s a more computationally
complex task).                                                                        tion is a useful filter on these entities. This bodes well for
                                                                                      the richness of downstream analysis that can be done using
                          6    Conclusion                                             this now structured and normalized data - such as clustering,
In this study, we introduced a set of pretrained named entity                         dimensionality reduction, semantic similarity, visualization,
recognition and assertion status detection models that are                            or graph-based analysis to identity correlated concepts. One
trained on biomedical and clinical datasets with deep learn-                          future research direction is to apply these downstream anal-
ing architectures on top of Spark NLP. We then present how                            yses on the richer, scalable, and more accurate insights that
to extract relevant facts from the CORD-19 dataset by ap-                             this NLP pipeline generates.
plying state-of-the-art NER and assertion status models in a                             Since NER and assertion status models in Spark NLP are
unified & scalable pipeline and shared the results to illustrate                      trainable, it is easy to add support for a new language like
extracting valuable information from scientific papers.                               German, French, or Spanish, as long as there is a annotated
   The results suggest that papers present in the CORD-19                             data for it. Spark NLP currently supports 46 languages and
include a wide variety of the many entity types that this new                         3 languages for Healthcare - English, German and Spanish.
NLP pipeline can recognize, and that assertion status detec-                          Spark NLP provides production-grade libraries for popular
programming languages - Python, Scala, Java and R - and          Perera, G.; Khondoker, M.; Broadbent, M.; Breen, G.; and
has an active community, frequent releases, public documen-      Stewart, R. 2014. Factors associated with response to acetyl-
tation and freely available code examples. Future work in this   cholinesterase inhibition in dementia: a cohort study from a
space includes adding support for additional languages, addi-    secondary mental health care case register in London. PloS
tional entity types, and extending the NLP pipeline further      one 9(11): e109484.
by adding relation extraction and entity resolution models.      Pyysalo, S.; and Ananiadou, S. 2014. Anatomical entity
                                                                 mention recognition at literature scale. Bioinformatics 30(6):
                       References                                868–875.
Cheng, X.; Cao, Q.; and Liao, S. 2020. An overview of litera-    Segura Bedmar, I.; Martı́nez, P.; and Herrero Zazo, M. 2013.
ture on COVID-19, MERS and SARS: Using text mining and           Semeval-2013 task 9: Extraction of drug-drug interactions
latent Dirichlet allocation. Journal of Information Science .    from biomedical texts (ddiextraction 2013). Association for
                                                                 Computational Linguistics.
Chiu, J. P.; and Nichols, E. 2016. Named entity recogni-
tion with bidirectional LSTM-CNNs. Transactions of the           Sousa, D.; Lamurias, A.; and Couto, F. M. 2019. A silver
Association for Computational Linguistics 4: 357–370.            standard corpus of human phenotype-gene relations. arXiv
                                                                 preprint arXiv:1903.10728 .
da Silva, J. A. T.; Tsigaris, P.; and Erfanmanesh, M. 2020.
Publishing volumes in major databases related to Covid-19.       Stubbs, A.; Kotfila, C.; Xu, H.; and Uzuner, Ö. 2015. Identify-
Scientometrics 1 – 12.                                           ing risk factors for heart disease over time: Overview of 2014
                                                                 i2b2/UTHealth shared task Track 2. Journal of biomedical
Doğan, R. I.; Leaman, R.; and Lu, Z. 2014. NCBI disease         informatics 58: S67–S77.
corpus: a resource for disease name recognition and concept
normalization. Journal of biomedical informatics 47: 1–10.       Sun, W.; Rumshisky, A.; and Uzuner, O. 2013. Evaluating
                                                                 temporal relations in clinical text: 2012 i2b2 Challenge. Jour-
Fancellu, F.; Lopez, A.; and Webber, B. 2016. Neural net-        nal of the American Medical Informatics Association 20(5):
works for negation scope detection. In Proceedings of the        806–813.
54th annual meeting of the Association for Computational
                                                                 Tzitzivacos, D. 2007. International Classification of Diseases
Linguistics (volume 1: long papers), 495–504.
                                                                 10th edition (ICD-10):: main article. CME: Your SA Journal
Henry, S.; Buchan, K.; Filannino, M.; Stubbs, A.; and Uzuner,    of CPD 25(1): 8–10.
O. 2020. 2018 n2c2 shared task on adverse drug events and        Uzuner, Ö.; Luo, Y.; and Szolovits, P. 2007. Evaluating the
medication extraction in electronic health records. Journal of   state-of-the-art in automatic de-identification. Journal of the
the American Medical Informatics Association 27(1): 3–12.        American Medical Informatics Association 14(5): 550–563.
Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-Wei, H. L.; Feng,   Uzuner, Ö.; South, B. R.; Shen, S.; and DuVall, S. L. 2011.
M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and     2010 i2b2/VA challenge on concepts, assertions, and relations
Mark, R. G. 2016. MIMIC-III, a freely accessible critical        in clinical text. Journal of the American Medical Informatics
care database. Scientific data 3(1): 1–9.                        Association 18(5): 552–556.
Kim, J.-D.; Ohta, T.; Tsuruoka, Y.; Tateisi, Y.; and Collier,    Wang, L. L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.;
N. 2004. Introduction to the bio-entity recognition task at      Eide, D.; Funk, K.; Kinney, R.; Liu, Z.; Merrill, W.; et al.
JNLPBA. In Proceedings of the international joint work-          2020. CORD-19: The Covid-19 Open Research Dataset.
shop on natural language processing in biomedicine and its       ArXiv .
applications, 70–75. Citeseer.                                   Yadav, V.; and Bethard, S. 2019. A survey on recent advances
Liede, A.; Hernandez, R. K.; Roth, M.; Calkins, G.; Larrabee,    in named entity recognition from deep learning models. arXiv
K.; and Nicacio, L. 2015. Validation of International Classi-    preprint arXiv:1910.11470 .
fication of Diseases coding for bone metastases in electronic
health records using technology-enabled abstraction. Clinical
epidemiology 7: 441.
Liu, S.; Tang, B.; Chen, Q.; and Wang, X. 2015. Effects
of semantic features on machine learning-based drug name
recognition systems: word embeddings vs. manually con-
structed dictionaries. Information 6(4): 848–865.
Murdoch, T. B.; and Detsky, A. S. 2013. The inevitable
application of big data to health care. Jama 309(13): 1351–
1352.
Nédellec, C.; Bossy, R.; Kim, J.-D.; Kim, J.-J.; Ohta, T.;
Pyysalo, S.; and Zweigenbaum, P. 2013. Overview of
BioNLP shared task 2013. In Proceedings of the BioNLP
shared task 2013 workshop, 1–7.
                Appendices                                        nlpPipeline = Pipeline(stages=[
                                                                      documentAssembler,
   A    NER Model Training Tagging Schema                             sentenceDetector,
BIO (Begin, Inside and Outside) and BIOES (Begin, Inside,             tokenizer,
Outside, End, Single) schemes for encoding entity annota-             word_embeddings,
tions as token tags. Words tagged with O are outside of named         clinical_ner,
                                                                      ner_converter,
entities and the I-XXX tag is used for words inside a named
                                                                      clinical_assertion
entity of type XXX. Whenever two entities of type XXX are
immediately next to each other, the first word of the second
entity will be tagged B-XXX to highlight that it starts another
entity. On the other hand, BIOES (also known as BIOLU) is           C    Training an NER Model in Spark NLP
a little bit sophisticated annotation method that distinguishes
between the end of a named entity and single entities. BIOES
stands for Begin, Inside, Outside, End, Single. In this scheme,   from pyspark.ml import Pipeline
for example, a word describing a gene entity is tagged with       import sparknlp
                                                                  from sparknlp.training import CoNLL
“B-Gene” if it is at the beginning of the entity, “I-Gene” if     from sparknlp.annotator import *
it is in the middle of the entity, and “E-Gene” if it is at the
end of the entity. Single-word gene entities are tagged with      spark = sparknlp.start()
“S-Gene”. All other words not describing entities of interest
are tagged as ‘O’.                                                training_data = CoNLL().readDataset(spark, ’
                                                                      BC5CDR_train.conll’)
        B    Defining a Spark NLP Pipeline
                                                                  word_embedder = WordEmbeddings.pretrained(’
                                                                      wikiner_6B_300’, ’xx’) \
from sparknlp_jsl.annotator import *                               .setInputCols(["sentence",’token’])\
                                                                   .setOutputCol("embeddings")
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\                                           nerTagger = NerDLApproach()\
  .setOutputCol("document")                                         .setInputCols(["sentence", "token", "
                                                                      embeddings"])\
sentenceDetector = SentenceDetector()\                              .setLabelColumn("label")\
  .setInputCols(["document"])\                                      .setOutputCol("ner")\
  .setOutputCol("sentence")                                         .setMaxEpochs(10)\
                                                                    .setDropout(0.5)\
tokenizer = Tokenizer()\                                            .setLr(0.001)\
  .setInputCols(["sentence"])\                                      .setPo(0.005)\
  .setOutputCol("token")                                            .setBatchSize(8)\
                                                                    .setValidationSplit(0.2)\
word_embeddings = WordEmbeddingsModel.
    pretrained("embeddings_clinical", "en",                       pipeline = Pipeline(
    "clinical/models")\                                               stages = [
  .setInputCols(["sentence", "token"])\                               word_embedder,
  .setOutputCol("embeddings")                                         nerTagger
                                                                    ])
clinical_ner = NerDLModel.pretrained("
    ner_clinical", "en", "clinical/models")                       ner_model = pipeline.fit(training_data)
    \
  .setInputCols(["sentence", "token", "
    embeddings"]) \                                                  D    Pretrained NER Models and Entities
  .setOutputCol("ner")
                                                                                       Covered
ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"
                                                                  ner anatomy coarse
    ]) \                                                          (Pyysalo and Ananiadou 2014)
  .setOutputCol("ner_chunk")
                                                                    Entities: anatomy
clinical_assertion = AssertionDLModel.
    pretrained("assertion_dl", "en", "                            ner anatomy
    clinical/models") \
    .setInputCols(["sentence", "ner_chunk",                       Entities: organism substance, organ, cellular component,
    "embeddings"]) \                                              immaterial anatomical entity, tissue, organism subdivision,
    .setOutputCol("assertion")                                    anatomical system, cell, pathological formation, develop-
                                                                  ing anatomical structure, multi
ner bionlp                                                           communicable disease, psychological condition, hyperten-
(Nédellec et al. 2013)                                              sion, direction, o2 saturation, hyperlipidemia, imagingfind-
  Entities: cellular component, organ, cancer, or-                   ings, vs finding, allergen, dosage, kidney disease, bmi,
ganism substance,       multi,    simple chemical,     tis-          smoking, pulse, ldl, symptom, labour delivery, relation-
sue,        anatomical system,       organism subdivision,           ship status, external body part or region, hdl, respiration,
immaterial anatomical entity,      organism,      develop-           procedure, height, vital signs header, relativetime, rela-
ing anatomical structure, amino acid, gene or gene product,          tivedate, injury or poisoning, medical device, test result,
pathological formation, cell                                         duration, age, admission discharge, ner medmentions coarse,
                                                                     pathologic function, geographic area, group, diagnos-
ner cellular                                                         tic procedure,     organic chemical,      organism attribute,
                                                                     mental or behavioral dysfunction,        organization,    re-
(Kim et al. 2004)
                                                                     search activity,       therapeutic or preventive procedure,
  Entities: dna, cell line, cell type, rna, protein
                                                                     biomedical or dental material, mammal, genetic function,
                                                                     body system, substance, daily or recreational activity, quan-
ner clinical
                                                                     titative concept, health care activity, molecular function,
(Uzuner et al. 2011)                                                 indicator, reagent, or diagnostic aid, body substance, virus,
  Entities: treatment, problem, test                                 eukaryote, disease or syndrome, spatial concept, anatom-
                                                                     ical structure,      body part, organ, or organ component,
ner deid                                                             laboratory procedure,           sign or symptom,          nu-
(Stubbs et al. 2015)                                                 cleic acid, nucleoside, or nucleotide,       food,      men-
  Entities: location, contact, date, profession, name, age, id       tal process, prokaryote, nucleotide sequence, profes-
                                                                     sional or occupational group, cell, biologic function,
ner deid enriched                                                    manufactured object, molecular biology research technique,
 (Stubbs et al. 2015)                                                gene or genome, chemical, neoplastic process, phar-
   Entities: idnum, country, date, profession, medicalrecord,        macologic substance,         tissue,     qualitative concept,
username, organization, zip, id, healthplan, location, device,       amino acid, peptide, or protein,        fungus,       popula-
hospital, city, email, doctor, street, state, patient, bioid, url,   tion group, body location or region, clinical attribute,
phone, fax, age                                                      injury or poisoning, medical device, cell component, plant

ner diseases                                                         ner posology
(Doğan, Leaman, and Lu 2014)                                         (Henry et al. 2020)
  Entities: disease                                                     Entities: form, dosage, strength, drug, route, frequency,
                                                                     duration
ner drugs
                                                                     ner risk factors
 (Henry et al. 2020), (Segura Bedmar, Martı́nez, and Her-
rero Zazo 2013)                                                       (Stubbs et al. 2015)
   Entities: drug                                                       Entities: family hist, smoker, obese, medication, hyper-
                                                                     tension, hyperlipidemia, phi, diabetes, cad
ner events clinical
                                                                     ner human phenotype go clinical
 (Sun, Rumshisky, and Uzuner 2013)
                                                                     (Sousa, Lamurias, and Couto 2019)
   Entities: test, problem, clinical dept, occurrence, date,
                                                                       Entities: go, hp
time, evidential, treatment, frequency, duration
                                                                     ner human phenotype gene clinical
jsl ner wip clinical
                                                                     (Sousa, Lamurias, and Couto 2019)
(in-house annotations from mtsamples and MIMIC-III (John-              Entities: gene, hp
son et al. 2016))
   Entities:         triglycerides,       oncological,       fe-     ner chemprot clinical
male reproductive status, form, time, date, alcohol,
                                                                     Entities: gene, chemical
medical history header,         race ethnicity,    temperature,
drug brandname,          frequency,    fetus newborn, sexu-          ner ade clinical
ally active or sexual orientation, disease syndrome disorder,
section header, social history header, strength, cerebrovas-         Entities: ade, drug
cular disease, family history header, employment, weight,
pregnancy, total cholesterol, diet, ekg findings, gender,            ner chemicals
drug ingredient, vaccine, substance, oxygen therapy, inter-          Entities: chem
nal organ or component, blood pressure, overweight, obe-
sity, birth entity, heart disease, diabetes, substance quantity,     ner bacterial species
treatment, death entity, route, modifier, test, clinical dept,       Entities: species