=Paper=
{{Paper
|id=Vol-2664/cantemist_paper14
|storemode=property
|title=Identification of Cancer Entities in Clinical Text Combining Transformers with Dictionary Features
|pdfUrl=https://ceur-ws.org/Vol-2664/cantemist_paper14.pdf
|volume=Vol-2664
|authors=John D Osborne,Tobias O’Leary,James Del Monte,Kuleen Sasse
|dblpUrl=https://dblp.org/rec/conf/sepln/OsborneOMS20
}}
==Identification of Cancer Entities in Clinical Text Combining Transformers with Dictionary Features==
<pdf width="1500px">https://ceur-ws.org/Vol-2664/cantemist_paper14.pdf</pdf>
<pre>
Identification of Cancer Entities in Clinical Text
Combining Transformers with Dictionary Features
John D. Osbornea , PhD, Tobias O’Learya , BSc, James Del Montea , Kuleen Sassea and
Wayne H. Lianga , MD MS
a
    University of Alabama at Birmingham, 720 2nd Ave South, Birmingham, 35294, Alabama, USA


                                         Abstract
                                         Clinical NLP tools that automatically extract cancer concepts from unstructured Electronic Health Record
                                         (EHR) text can benefit cancer treatment matching, clinical trials cohort identification, and reportable can-
                                         cer abstraction. We used a combination of two BERT-based [1] language models, BETO [2] and MBERT
                                         [1]; with regular expressions constructed from training data; and ICD-O dictionary based features to par-
                                         ticipate in the tumor named-entity recognition subtask of the 2020 CANTEMIST (CANcer TExt Mining
                                         Shared Task) [3]. Our goal is to explore the incorporation of dictionary-based features into these mod-
                                         els to provide better integration between machine learning models and external knowledge resources.
                                         Results on the test data set were highest with a regular expression based system (F-Score 0.73) and devel-
                                         opment set results showed a 5 point drop in F-Score (0.76 to 0.71) when integrating dictionary features
                                         into our BETO based system. We suggest that dictionary-based features will need careful integration to
                                         improve the performance of masked language models.

                                         Keywords
                                         clinical concept recognition, NLP, named entity recognition, information extraction, cancer


1. Introduction
The widespread adoption of Electronic Health Records (EHR) has resulted in an explosion in the
volume of clinical data captured electronically. Computerized methods such as data analytics
and clinical decision support (CDS) can be applied on clinical data to accelerate new scientific
discoveries and improve clinical care delivery. This is particularly important in the field of
oncology: cancer treatments are highly specific to cancer subtypes based upon clinical and
tumor attributes (e.g., Philadelphia chromosome-positive B-cell Acute Lymphoblastic Leukemia);
cancer clinical trials require identification of patients who meet high specific eligibility criteria
(e.g., cancer subtype, clinical features, biomarker status); and cancer reporting for public health
surveillance and quality assurance requires abstracting detailed cancer-related attributes from
the clinical record. Each of the above examples would highly benefit from automated extraction
of cancer concepts from the EHR[4]. However, much of the rich phenotype data required
for the above examples are found not in machine-readable structured text, but are found

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: ozborn@uab.edu (J.D. Osborne); tobiasoleary@uab.edu (T. O’Leary); jvdelmon@uab.edu (J.D. Monte);
ksasse@uab.edu (K. Sasse); wliang@uabmc.edu (W.H. Liang)
url: https://github.com/ozborn/ (J.D. Osborne)
orcid: 0000-0002-0851-1150 (J.D. Osborne); 0000-0001-7116-9338 (T. O’Leary); 0000-0003-2354-9787 (W.H. Liang)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                       IberLEF 2020, September 2020, Málaga, Spain.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
solely in unstructured texts (e.g., clinic notes, pathology reports, radiology reports)[5]. Natural
Language Processing (NLP) tools that can automatically and accurately extract cancer concepts
from unstructured clinical texts can increase the spectrum of data available for computational
methods, thereby benefiting cancer research and care delivery. In this report, we compare
multiple approaches to extracting cancer entities from CANTEMIST[3], a Spanish language
clinical text data set.

1.1. Background
The first task of structured data extraction is the identification of the specific span of text
(mention) containing the name of interest. This is referred to as Named-Entity Recognition
(NER), or clinical entity recognition in the context of clinical text. NER software that has
been developed specifically for clinical text includes cTAKES[6], CliNER[7] and other machine
learning approaches utilizing support vector machines[8] and conditional random fields[9].
More recent methods have applied neural networks to clinical NER [10, 11, 12], including Deep
Learning (DL) methods[13]. In particular, the development of the transformer architecture[14]
and masked language models like BERT[1] and its siblings has yielded impressive results on non-
clinical benchmarks like SuperGLUE[15]. Subsequently, a variety of English clinical language
embeddings[16, 17, 18] have been developed, as well as non-clinical multilingual models such
as MBERT[1] and language-specific models such as BETO[2].
   Relatively little attention has been given to integrating dictionary features for large clini-
cal vocabularies into these types of architectures for clinical NER. One recent exception[19]
incorporated dictionaries into a Bi-LSTM-CRF DL model by integrating feature vectors with
character embeddings, obtaining good results for Chinese clinical NER. Incorporation of dic-
tionaries into DL models could allow for both the higher performance of DL models, while
yielding the user control and understanding provided by dictionaries. For example, dictionary
integration could allow for easier incorporation of vocabulary updates, such as changes to
the International Classification of Diseases for Oncology (ICD-O) codes, or changes to cancer
reporting requirements for tools like the Cancer Registry Control Panel (CRCP)[20].
   For this paper, we explore the integration of transformer-based language models (such as
BETO and MBERT) with external knowledge resources, as well as their applicability to clinical
entity normalization for cancer concepts.


2. Methodology
We developed a total of 8 different systems for this task, including BETO-FLAIR, REGEX, BETO-
FLAIR-REGEX, MBERT-REGEX, MBERT-PYTORCH, MBERT-DICT, BETO-PYTORCH and BETO-
DICT. Only 3 systems, BETO-FLAIR, MBERT-PYTORCH and REGEX, were finished in time to be
official entries for the CANTEMIST shared task, but results are shown for all systems. All systems
with masked language models were constructed utilizing Huggingface’s implementations of
transformer-based language models[21], specifically BETO[2] and MBERT[1], the multi-lingual
extension of BERT. Specific details of the language models are in their own sections below.
   We used three distinct methods for predicting annotations: fine-tuned masked language
models (BETO-FLAIR, BETO-PYTORCH and MBERT-PYTORCH ), regular expressions alone


                                               459
Table 1
Cantemist Data Set

                            Data Set    Reports   Sentences   Tokens
                            train-set       501       18540   456447
                            dev-set1        250        9092   226371
                            dev-set2        250        8332   183356
                            test-set        300       10727   248770


Table 2
Number of Reports, Sentences, and Tokens in each annotated data set.


or in conjunction with masked language models (REGEX, BETO-FLAIR-REGEX and MBERT-
REGEX ), and dictionary features in conjunction with masked language models (MBERT-DICT
and BETO-DICT ). BETO is a BERT language model pretrained on a large Spanish corpus and out-
performs MBERT on several Spanish tasks, including natural language inference, paraphrasing,
NER, and document classification[2]. Multilingual BERT or MBERT is a cross-lingual extension
of BERT, trained to perform on 104 languages (https://github.com/google-research/bert/blob/
master/multilingual.md#list-of-languages). Similar to the BERT-base and BETO models, MBERT
has 110 million parameters. MBERT was pretrained using text data from Wikipedia in each of
the top 104 largest languages by number of articles, and features a wordpiece vocabulary of
119,000 words, shared for all languages. Regular expressions were included to create a baseline
for comparison.

2.1. Data
The Cantemist NER subtask is an information extraction task to identify tumor morphology
mentions in a Spanish language corpus of synthetic oncological clinical case reports. The corpus
was annotated with a single class, "MORFOLOGIA_NEOPLASIA," in the BRAT[22] standoff
format. The corpus was split into 4 sets: train-set, dev-set1, dev-set2 and test-set, and included
an unannotated background set that was not utilized. Table 1 contains the overall size of these
sets. Also available was a list of ICD-O-3 codes (valid_codes.txt) containing the morphology
codes and an associated term and comment. This was used for dictionary-based systems; no
other third party data were used to develop our systems.

2.2. Pre-Processing
Cantemist input files in BRAT standoff format (.ann files) were converted to CoNLL format for
processing by all systems, except REGEX. Input text was tokenized using the NLTK Spanish
tokenizer and converted to either a 2 or 3 column format. The first column specified the token,
and the second column specified the Cantemist tag in IOB (Inside-Outside-Begin) format. An
optional third column was used in BETO-FLAIR-REGEX and MBERT-REGEX to specify the logits
resulting from the pytorch-based MBERT-REGEX system, which was used to adjust the cutoff


                                               460
Table 3
Example of Regular Expression Construction
              Annotated Strings                             Regular Expression
                                                                  (?:\W)
        adenocarcinoma mucinoso                     ([Aa]denocarcinoma\s+mucinoso\s+
       moderadamente diferenciado                     moderadamente\s+diferenciado|
   tumoración sólida, estirpe mesenquimal    [Tt]umoración\s+sólida,\s+estirpe\s+mesenquimal|
         diseminación intracraneal                    [Dd]iseminación\s+intracraneal|
                                                     [^a-zA-ZÁÉÍÓÚÑÜáéíóúñü]?LOEs
               (LOEs) hepáticas
                                                 [^a-zA-ZÁÉÍÓÚÑÜáéíóúñü]?\s+hepáticas|
                 pT2bN0M1                                     [Pp]T2bN0M1|
                   HCC                                            HCC|
                    CCR                                           CCR)
                     Ca
                                                                  (?=\W)


frequency for the regular expression component (REGEX ). During the conversion, many of
the annotations had overlaps which affected the performance of the converter. To handle
overlapping annotations, we kept only the longest of the overlapping annotation, based on span
length.

2.3. BETO-FLAIR
For our first system, we used the sequence tagger from Flair version 0.4.2. We loaded the
pretrained BETO[1] cased model as the base model for our sequence tagger [2]. BETO is a BERT
model that was pretrained with a Spanish corpus of approximately 3 billion Spanish words
using a similar architecture to BERT-base. Both have the same 110 million parameters. However,
BETO has 32,000 words in its vocabulary, compared with 30,000 in BERT-base. We trained the
model for 20 epochs with a batch size of 32 using the train set as the training data. We validated
on dev-set1 and tested on dev-set2. This language-specific masked language model system was
used as a reference point for an "off the shelf" clinical NER implementation.

2.4. REGEX
We constructed a regular expression by joining together a unique list of each annotated mention
of cancer in the training and development sets. When evaluating against dev-set2 we excluded
annotation in that set. We removed a single two-letter string, ’Ca’, because it generated more
false positives than true positives. After replacing any regex escape character to match any
non-Spanish letter, we allowed the first character of the string to match both its upper and
lowercase forms, if the length of the string was greater than 5. This boundary was chosen
by manually reviewing the output. We listed these expressions from longest to shortest and
concatenated all of them together with the regex ’or’ operator, enforcing a word-boundary
before and after. See Table 3 for a clarifying example. Any regex match was predicted as
"MORFOLOGIA_NEOPLASIA".


                                               461
2.5. MBERT-PYTORCH and BETO-PYTORCH
Both implementations use Huggingface’s Transformers[21] library to provide the MBERT-
PYTORCH model (using the pretrained ’bert-base-multilingual-cased’ model) or the large cased
model of BETO BETO-PYTORCH. Data from CoNLL files are packed into samples as close as
possible to BERT ’s maximum of 512 tokens. The assigned labels consist of IOB tokens, indicating
whether a given subword is the beginning ("B") of a match, the interior ("I") of a match, or
outside ("O") the scope. The model functions as a standard pytorch model and trained for 4
epochs. The batch size was 8; however, the implementation makes use of gradient accumulation
to only calculate gradients after a specified number of training steps, effectively giving the
model a batch size of 32 to match the BERT paper. Since Huggingface provides tools only for
subword (token) classification and sample classification, we used the label generated for the first
subword in a word as the label for that word for both MBERT-PYTORCH and BETO-PYTORCH.

2.6. BETO-FLAIR-REGEX and MBERT-REGEX
The MBERT-PYTORCH and BETO-FLAIR systems were extended by integration with the REGEX
system. Both systems were modified to output a confidence score, ranging from 0 to 1, as-
sociated with the prediction for each token. We obtain these scores by applying softmax to
the logit values returned by the BERT model. For tokens that were initially classified as not
"MORFOLOGIA_NEOPLASIA" (the "O" class) that were later contained in a span of text the
REGEX system classified as "MORFOLOGIA_NEOPLASIA," we would adjust the confidence
score by +0.15. If the adjusted confidence score of any of the tokens crossed the 0.5 boundary,
we changed the classification to "MORFOLOGIA_NEOPLASIA" for all tokens matched in the
regular expression. The confidence score adjustment of +0.15 was chosen empirically after
manually reviewing the output of the two systems. The decision to apply the adjustment solely
to the "O" class was made in hopes of improving recall without reducing precision. This resulted
in only a few adjustments to the predicted results of the two systems: BETO-FLAIR-REGEX
resulted in 1998 annotation changes compared with BETO-FLAIR alone across both the test and
background sets, and MBERT-REGEX resulted in 59 changes compared with MBERT-PYTORCH
alone.

2.7. MBERT-DICT and BETO-DICT
The MBERT-DICT and BETO-DICT systems extends MBERT-PYTORCH and BETO-PYTORCH
respectively by the use of dictionary-based features described in the next section. We wrote a
custom head for the BertForTokenClassification model which concatenates these dictionary-
based features with the logits corresponding to each subword in a sample. These extended
samples run through two stacks of dropout and fully-connected layers, first mapping to the
HIDDEN_SIZE, then applying the standard mapping to NUM_LABELS. Results for this model
converge in 4 epochs.


                                               462
2.8. Dictionary Features
Table 4 summarizes dictionary-based features used in the BETO-DICT and MBERT-DICT systems.
Features were selected in order to assess both the coverage and the cohesiveness of input
mentions relative to a term representation of all term names and synonyms in the Cantemist-
provided ICD-O dictionary (valid-codes.txt file) for an entry.
   For subword-based dictionary features, the presence of input mention text dictionary-based
features were calculated for each input subword present in the mention. We utilized the entire
512 BETO subword limit as our lookup window.
   Character-based dictionary features were calculated using a Python string similarity library
(https://github.com/luozhouyang/python-string-similarity). Character-based comparisons were
made between a character-based term representation of the dictionary term and the lookup
window. Parameters for character-based dictionaries, including overlap co-efficient and shingle
size, were determined empirically, resulting in values of 3 and 5 respectively. All dictionary
features are described in detail below.

Highest subword coverage The highest subword coverage calculates the number of over-
lapping subwords in the subword window for each ICD-O term’s dictionary name and any of its
respective synonyms (term representation) to compute a subword overlap count for each term.
The highest subword coverage is the count of subword overlaps for the term with the maximum
subword overlaps for the dictionary.

Distinct subword Each subword in the subword window is matched against all dictionary
term representations resulting in a set of all the entries with at least one subword. One or more
subwords will return a set with the lowest cardinality. That minimum cardinality over all
returned sets from all subwords in the subword window is the distinct subword.

Average matches The average number of subword window term representation over the entire
dictionary.

Fraction of entries with highest subword coverage The fraction of entries whose term
representation is the maximum number of subwords represented in the subword window.

Differential subword coverage The differential subword coverage is computed by taking
the "Highest subword coverage" described above and subtracting the average subword coverage
or overlap between the subword window and all subword term representation in the ICD-O
dictionary.

Best entry log subword frequencies The subword frequency for each subword in the ICD-
O dictionary is computed for all subwords overlapping between the subword window and the
subword term representation. The log of these frequencies is summed for each comparison and
the highest value is used as the "best entry log subword frequencies".


                                              463
Table 4
Dictionary Features for Named Entity Normalization
                                  Feature Name               Feature Type
                            Highest subword coverage             Subword
                            Lowest subword specificity           Subword
                                 Average matches                 Subword
                              Fraction of entries with
                                                                 Subword
                            highest subword coverage
                          Differential subword coverage
                        Best entry log subword frequencies       Subword
                                Overlap coefficient              Character
                                      N-Gram                     Character


Table 5
Official Test Results
                               System            Prec   Recall    F-Score
                            BETO-FLAIR         0.736    0.609     0.667
                          MBERT-PYTORCH        0.673    0.357     0.467
                              REGEX            0.688    0.744     0.715


Overlap coefficient An overlap coefficient of 3 is used to calculate a Szymkiewicz-Simpson
similarity score between the mention subword and 10 adjacent characters on either side ("char-
acter extended subword window") to a character-based term representation.

Shingle N-Grams Character 5-gram profiles were pre-computed for each character-based
term representation and compared to a profile from the input mention’s character extended
subword window using cosine similarity .


3. Results
Official test results are shown in Table 5, and unofficial test results including all systems
developed are shown in Table 6. Only 3 systems were submitted in time for completion, but
we show test results for all data using the included eval script. The discrepancies between
the official and unofficial test results are caused by changes made to the utility that converted
.conll and .ann files, changes in the regex system to more loosely match escape characters,
and processing all files with MBERT-PYTORCH. Our original submission included only 80% of
files. Our best performing system for the official test results was the baseline REGEX system
which reflects underdevelopment of the masked language model systems, although BETO-FLAIR
had the highest precision. This is replicated in the unofficial results. Additionally we show
results on the development set in Table 7. The BETO-PYTORCH obtained the best results in all
3 evaluation metrics.


                                              464
Table 6
Updated Unofficial Test Results
                                  System        Prec   Recall   F-Score
                             BETO-FLAIR         0.74   0.61     0.67
                           BETO-PYTORCH         0.71   0.68     0.70
                           MBERT-PYTORCH        0.69   0.30     0.42
                               REGEX            0.70   0.77     0.73
                          BETO-FLAIR-REGEX      0.67   0.64     0.66
                            MBERT-REGEX         0.63   0.68     0.66
                             BETO-DICT          0.61   0.71     0.66
                             MBERT-DICT         0.64   0.67     0.65


Table 7
Development Data Set Results
                                  System        Prec   Recall   F-Score
                             BETO-FLAIR         0.68   0.61     0.64
                           BETO-PYTORCH         0.76   0.76     0.76
                           MBERT-PYTORCH        0.69   0.73     0.71
                               REGEX            0.67   0.74     0.70
                          BETO-FLAIR-REGEX      0.68   0.61     0.64
                            MBERT-REGEX         0.69   0.73     0.71
                             BETO-DICT          0.67   0.58     0.62
                             MBERT-DICT         0.67   0.74     0.70


4. Discussion
We were disappointed by the poor performance of pure transformer-based systems on the test
data relative to the simple regular expression-based system REGEX. We also tested using a pure
regular expression dictionary matching approach, but performance was worse than simply look-
ing for exact matches in the training data (data not shown). However, on the development data
BETO-PYTORCH did produce the best results. Our efforts to integrate dictionary features with
masked language models also yielded disappointing results. We suspect the poor performance
of the dictionary-based approach is due to limited development time, the reliance on subwords
(versus words), overly large lookup window (512 subwords instead of a sentence or smaller
window) and the lack of dictionary feature validation and testing rather than the integration of
these features into the BERT model.
   Combining BERT based model with REGEX did not result in a significant improvement.
Recall was slightly higher when evaluating on the test data set, but at a cost of lower precision.
High recall with lower precision is naturally expected when using regular expression-based
systems for NER. Picking a single confidence score adjustment and applying that across multiple
trained models also likely caused lower performance, since each model’s average confidence
score for a given class was significantly different.


                                               465
Future Directions In the short term, we plan to expand the number of dictionary-based
features to better account for term variation and head nouns. Word level features also need to be
introduced, and the utility of subwords to handle medical abbreviations and relevant Latin and
Greek roots needs to be evaluated. Appropriate medical stemming, or use of a clinical subword
vocabulary, also needs to be evaluated. We are also interested in cross-language evaluation
(English-Spanish) of cancer extraction terms.

4.1. Limitations
Our work suffered from a number of limitations, the most important being the lack of a Spanish
speaker in our group, forcing us to rely on Google Translate and the similarity of Latin-based
medical terms. Due to time constraints, we did not perform a principled dictionary feature
evaluation to assess the relative importance of features. Parameter settings were not fully
evaluated for similar reasons.


Acknowledgments
This publication was supported by internal funding from the Informatics Institute at University
of Alabama at Birmingham, and a NvidiaTM grant of a Titan XP GPU used for machine learning.


References
 [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [2] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation
     data, in: to appear in PML4DC at ICLR 2020, 2020.
 [3] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal-
     ization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
 [4] G. K. Savova, I. Danciu, F. Alamudun, T. Miller, C. Lin, D. S. Bitterman, G. Tourassi, J. L.
     Warner, Use of natural language processing to extract clinical cancer phenotypes from
     electronic medical records, Cancer Research 79 (2019) 5463–5470.
 [5] S. T. Rosenbloom, J. C. Denny, H. Xu, N. Lorenzi, W. W. Stead, K. B. Johnson, Data from
     clinical notes: a perspective on the tension between structure and flexible documentation,
     Journal of the American Medical Informatics Association 18 (2011) 181–186.
 [6] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, C. G.
     Chute, Mayo clinical text analysis and knowledge extraction system (ctakes): architecture,
     component evaluation and applications, Journal of the American Medical Informatics
     Association 17 (2010) 507–513.
 [7] W. Boag, K. Wacome, T. Naumann, A. Rumshisky, Cliner: a lightweight tool for clinical
     named entity recognition, AMIA Joint Summits on Clinical Research Informatics (poster)
     (2015).


                                              466
 [8] B. Tang, H. Cao, Y. Wu, M. Jiang, H. Xu, Clinical entity recognition using structural
     support vector machines with rich features, in: Proceedings of the ACM Sixth International
     Workshop on Data and Text Mining in Biomedical Informatics, 2012, pp. 13–20.
 [9] Y. Zhang, X. Wang, Z. Hou, J. Li, Clinical named entity recognition from chinese electronic
     health records via machine learning methods, JMIR Medical Informatics 6 (2018) e50.
[10] Z. Liu, M. Yang, X. Wang, Q. Chen, B. Tang, Z. Wang, H. Xu, Entity recognition from
     clinical texts via recurrent neural network, BMC Medical Informatics and Decision Making
     17 (2017) 67.
[11] F. Dernoncourt, J. Y. Lee, P. Szolovits, Neuroner: an easy-to-use program for named-entity
     recognition based on neural networks, arXiv preprint arXiv:1705.05487 (2017).
[12] F. Dernoncourt, J. Y. Lee, O. Uzuner, P. Szolovits, De-identification of patient notes with
     recurrent neural networks, Journal of the American Medical Informatics Association 24
     (2017) 596–606.
[13] Y. Wu, M. Jiang, J. Xu, D. Zhi, H. Xu, Clinical named entity recognition using deep learning
     models, in: AMIA Annual Symposium Proceedings, volume 2017, American Medical
     Informatics Association, 2017, p. 1812.
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems,
     2017, pp. 5998–6008.
[15] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman,
     Superglue: A stickier benchmark for general-purpose language understanding systems,
     in: Advances in Neural Information Processing Systems, 2019, pp. 3266–3280.
[16] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott,
     Publicly available clinical bert embeddings, arXiv preprint arXiv:1904.03323 (2019).
[17] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2020)
     1234–1240.
[18] K. Huang, J. Altosaar, R. Ranganath, Clinicalbert: Modeling clinical notes and predicting
     hospital readmission, arXiv preprint arXiv:1904.05342 (2019).
[19] Q. Wang, Y. Zhou, T. Ruan, D. Gao, Y. Xia, P. He, Incorporating dictionaries into deep
     neural networks for the chinese clinical named entity recognition, Journal of Biomedical
     Informatics 92 (2019) 103133.
[20] J. D. Osborne, M. Wyatt, A. O. Westfall, J. Willig, S. Bethard, G. Gordon, Efficient identifi-
     cation of nationally mandated reportable cancer cases using natural language processing
     and machine learning, Journal of the American Medical Informatics Association (2016)
     ocw006.
[21] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, et al., Huggingface's transformers: State-of-the-art natural language pro-
     cessing, arXiv preprint arXiv:1910.03771 (2019).
[22] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, BRAT: a web-based
     tool for NLP-assisted text annotation, in: Proceedings of the Demonstrations at the 13th
     Conference of the European Chapter of the Association for Computational Linguistics,
     Association for Computational Linguistics, 2012, pp. 102–107.


                                               467

</pre>