Tick Parasitism Classification from Noisy Medical Records

                   James O’ Neill1∗ , Danushka Bollegala1 , Alan D. Radford2 and PJ Noble3
                            1
                              Department of Computer Science, University of Liverpool
                            2
                              Department of Infection Biology, University of Liverpool
                                3
                                  Small Animal Department, University of Liverpool
                              {danushka.bollegala, alanrad, rtnortle}@liverpool.co.uk

                               Abstract                                   would be valuable for surveillance of tick activity and subse-
                                                                          quent disease and in developing clinical decision support for
         Much of the health information in the medical do-                clinicians.
         main comes in the form of clinical narratives. The                  The aim of this work is to automate annotation of clinical
         rich semantic information contained in these notes               notes from small animal practice for the presence of TP. We
         can be modeled to make inferences that assist the                are motivated by the fact that using veterinary notes allows us
         decision making process for medical practitioners,               to keep the privacy of the small animals intact while improv-
         which is particularly important under time and re-               ing the quality of assistive diagnosis, and in turn, medication.
         source constraints. However, the creation of such                This is something that is not easily achieved outside of vet-
         assistive tools is made difficult given the ubiquity             erinary practices.
         of misspellings, unsegmented words and morpho-                      We are able to take advantage of small animal clini-
         logically complex or rare medical terms. This re-                cal records collected through the Small Animal Veterinary
         duces the coverage of vocabulary terms present in                Surveillance Network (SAVSNET). Narratives in the SAVS-
         commonly used pretrained distributed word repre-                 NET corpus are currently screened using simple regular-
         sentations that are passed as input to parametric                expressions to identify mentions of the word ’tick’. These
         models that makes such predictions. This paper                   mentions may refer to a tick present on a pet or simply to dis-
         presents an ensemble architecture that combines in-              cussion of tick prevention and need to be manually annotated
         domain and general word embeddings to overcome                   accordingly.
         these challenges, showing best performance on a                     We propose a dynamic ensemble neural network that learns
         binary classification task when compared to vari-                to classify TP from imprecise clinical narratives. Our ap-
         ous other baselines. We demonstrate our approach                 proach incorporates fine-tuned in-domain word embeddings
         in the context of the veterinary domain for the task             and domain-agnostic pretrained embeddings to improve clas-
         of identifying tick parasitism from small animals.               sification performance. A parameter is used to learn a
         The best model shows 84.29% test accuracy, show-                 weighted combination of each embedding in the overall clas-
         ing some improvement over models, which only                     sification.
         use pretrained embeddings that are not specifically
         trained for the medical sub-domain of interest.                  Contributions Our contributions are as follows:
                                                                              1. A novel application of text classification for noisy clini-
                                                                                 cal narratives, specifically in the veterinary domain.
1       Introduction
                                                                              2. To the best of our knowledge, the first known attempt to
Clinical narratives contain important and useful information                     predict TP on animals from textual descriptions.
about the health of a subject. Medical practitioners often have
to spend a considerable amount of time reading these notes                    3. An ensemble approach that combines domain agnostic
to make informed decisions, which can be quite laborious.                        and domain specific representations (n-gram character,
Parametric models can be used to extract information that                        subword and word vectors) to recurrent neural network
can assist medical experts in decision-making while reducing                     architecture and strong baselines of both non-neural net-
this burden. However, spelling mistakes, complex medical                         work and neural network classifiers.
terms and rare terms are ubiquitous in such clinical narra-
tives [Roberts et al., 2018].                                             2     Background
   Tick parasitism (TP) is commonly seen in veterinary pa-                2.1     Tick-borne disease
tients. Given that ticks can transmit a variety of diseases
including important zoonotic disease (e.g. Lyme’s disease),               Tick-borne diseases (TBD) are caused by a variety of
tools to screen clinical records for reporting of tick parasitism         pathogens (bacteria, viruses, rickettsia and protozoa) which
                                                                          are transmitted through tick-bites. Identifying and prevent-
    ∗
        james.o-neill@liverpool.ac.uk                                     ing TBD from spreading is difficult given that ticks have a
    Copyright © 2019 for this paper by its authors. Use permitted under
    Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                    30
wide geographic range and are highly adaptive to changing         create a timeline of adverse drug event mentions per patient.
environments allowing this range to increase. Globally, sig-      This was used to identify drug-drug-event associations for
nificant TBD include tularaemia and rocky mountain spotted        1,165 drugs and 14 events.
fever. In a recent study using electronic health records from        Other methods that do not use text have relied on geospa-
cats and dogs, TP was seen most commonly in the south-            tial data for modelling tick presence [Swart et al., 2014].
central region of England with a peak activity in summer and      The model predicted the presence of ticks within a 1 km2
a smaller peak in cats in Autumn [Tulloch et al., 2017].          grid from field data using satellite-based methodology with
                                                                  Bayesian priors chosen over landscape types. Ticks were es-
2.2    Economic Impact of Tick Parasitism                         timated for 54% land cover, finding a 37% presence from all
A recent study from Germany highlighted a potential cost of       677 coordinates sampled.
> 30M Euros resulting from Lyme borreliosis alone [Lohr              Lastly, recent work has performed medical text classi-
et al., 2015]. In addition, recent work reviewed the impact of    fication with convolutional neural networks (CNN) with
TP for production animals that are required to meet standard      Word2Vec as input [Hughes et al., 2017]. This was shown to
health conditions [Giraldo-Rı́os and Betancur, 2018], reduc-      outperform Logistic Regression that uses Doc2Vec represen-
ing the survival rate of the animal and the production of meat,   tations or Bag of Words (BOW) based Word2Vec approaches.
milk, eggs etc. (not to mention costs incurred for treatment).    They use k-means clustering (k=1000) to generate a feature
                                                                  vocabulary that is used to generate a soft assignment BOW
2.3    Small Animal Ticks in the UK                               histogram for each sentence. These features were then used
In this work, we focus on small animals within the United         as input to the Logistic regression model. The BOW with Lo-
Kingdom (UK). In the past decade, there has been an on-           gistic Regression yielded the best baseline results with 51%
going effort to collect ticks by the public, veterinary health    test accuracy, which was still 17% percentage points lower
agencies and practitioners within the UK as part of the Tick      than the proposed CNN model, which uses Word2Vec. How-
Surveillance Scheme and The Big Tick Project [Jameson and         ever, there is no use of recurrent models for preserving the
Medlock, 2011; Abdullah et al., 2016] in an effort to iden-       sequential nature of text.
tify various tick species (predominantly from companion ani-
mals) and their locality across various regions in the UK. Al-    4       Methodology
though these projects demonstrate the viability of nationwide     4.1       Models
surveillance programs to monitor tick species, we argue that
                                                                  Model Configurations
the requirement for active manual participation by contribu-
tors and lack of automation along with inconsistent/aperiodic     For all the below models we use the Binary Cross Entropy
data-collection present barriers to participation and to colla-   (BCE) loss, with a learning rate η = 0.001 and Adaptive
tion of representative data in the long term. Using an au-        Momentum (adam) for optimization. Dropout is used for
tomated system to screen clinical notes for tick parasitism       regularization on all layers (not including the input) with a
will enhance tick-surveillance. Furthermore this work will        dropout rate pd = 0.2. The pretrained embeddings used
provide a model for systems that might be used for clinical       are fixed throughout training (i.e no gradient udpates). The
note summarisation and, subsequently, for clinical decision-      batch size |xs | = 200 for each model. Given the class imbal-
making support.                                                   ance, we choose to weight the losses inversely proportional
                                                                  to the frequency of each class during each mini-batch up-
                                                                  date. This avoids other alternative approach such as sampling
3     Related Research                                            methods [Chawla et al., 2002] with little cost.
Most previous work on medical text classification has focused
                                                                  Convolutional Neural Network
on cleaner text, which are extracted from more formal regis-
ters. Although, there has been recent work that has explored      We test CNNs for text classification, which too have been
classification on health records and notes which we include       used in the medical domain (as aforementioned [Hughes et
below.                                                            al., 2017]), first proposed by Kim et al. (2014). The CNN
                                                                  model uses 100 2d filters each for kernels of size (2, 300) and
   A key challenge is making use of information rich notes
                                                                  (3, 300) for character n-gram embeddings (GloVe), subword
while reducing redundancy contained in the corpus due to
                                                                  embeddings (FastText)1 and word embeddings (Word2Vec 2 ),
the copying of notes which can lead to a degradation in per-
                                                                  all of which are dw = 300.
formance. In the context of topic modelling, prior work has
                                                                  Our motivation for using FastText is that subword embed-
applied a variant of Latent Dirichlet Allocation (LDA) to pa-
                                                                  dings are first learned to create word embeddings and there-
tient record notes [Cohen et al., 2014], which they refer to as
                                                                  fore mitigates the problem of misspellings, while they are
Red-LDA. Red-LDA removes this redundancy and improves
                                                                  also used to deal with out-of-vocabulary terms (a new word is
on topic coherence and qualitative assessments in comparison
                                                                  likely to share some subwords with the words already in the
to standard LDA.
                                                                  vocabulary). ReLU activations are used with 1d max pooling
    Yi and Beheshti used a hidden Markov model for classify-
                                                                  after each layer followed by a concatenation of the last layer
ing medical documents that incorporates prior knowledge in
                                                                  features.
the form of medical subject headings.
                                                                      1
    Iyer et al. have performed text mining on clinical text               pretrained-fasttext:https://fasttext.cc/docs/en/crawl-vectors.html
                                                                      2
for drug-event recognition from 50 million clinical notes to              pretrained-skipgram: https://code.google.com/archive/p/word2vec/


                                                                                                                                          31
Gated Recurrent Network
As a second baseline approach, we test recurrent architectures
with memory networks to preserve any non-local dependen-
cies between terms, which we would expect to further im-
prove performance. The Gated Recurrent (GRU) model uses
2-hidden layers where the last output layer (1, 300) is passed
to a dense layer. The weights are initialized using Xavier
normalisation [Glorot and Bengio, 2010] (µ=0,σ=0.01) and
tanh activation units are used.                                   Figure 1: GRU-based Ensemble Architecture (red corresponds to the
Ensembled Feature Approach                                        last hidden state vector that outputs embedding for both in-domain
                                                                  and large pretrained embedding inputs)
In the above two aforementioned models, the challenge of
poorly typed notes is addressed using n-gram character vec-
tors, sub-word vectors and word vectors. In the ensemble ap-         In contrast, we also consider passing each embedding sepa-
proach shown in Figure 1, we combine the latter two by con-       rately and instead perform ensembling at the output as shown
catenating both final hidden layer encodings (red) and pass       in Equation 3 (and shown in Figure 1), in which case Θ ∈
it to a dense layer (green) before making the final prediction    R3m . In our experiments, we found the latter of these two
ŷ. This allows for interaction terms among both sentence en-     approaches to outperform the former.
codings created by sub-word and word vectors. For regular-
ization, we also use dropout in this dense layer with a rate
                                                                                    ŷ = φ(hh̃t ⊕ h̄S     W
                                                                                                                
pd = 0.5, while other layers are kept at pd = 0.2 as previ-                                         t ⊕ h̄t , Θi                  (3)
ously mentioned.                                                     Binary Cross Entropy (BCE) loss is then used as the ob-
    We also evaluate this approach when combining in-domain       jective, as shown in Equation 4 where N is the number of
word embeddings trained on the clinical narratives and pre-       samples in a given mini-batch update.
trained embeddings. This allows us to systematically com-
bine the benefits of both vector representation by simply                              N
adding a dense layer that acts a weighted combination of both                      1 X
                                                                      `(y, ŷ) = −       yi log(ŷi ) + (1 − yi ) log(1 − ŷi )   (4)
sentence embeddings to produce the final encoding. In a sim-                       N i=1
ilar fashion we carry this ensemble method out for the pre-
viously mentioned 2-hidden layer Convolutional Neural Net-        5    Experimental Data
work.
                                                                  Collected Dataset We demonstrate our method on the task
    Below we summarize the steps in Equation 1 where E is
                                                                  of identifying TP in clinical records from animals, which
an embedding matrix, Ẽ is a fine-tunable E and ẼS , ẼW         to our knowledge is a novel application of text-based ma-
are both subword and word pretrained embeddings repsec-           chine learning for this problem. The Small Animal Veterinary
tively, which are not updated during training. The input to-      Surveillance Network (SAVSNET) dataset contains approxi-
kens w ∈ Rn are passed to the embedding matrix E ∈ Rn×d           mately 3.5 million records. The health records are submitted
which are then transformed with parameters W ∈ Rd×m and           to SAVSNET at the end of consultations by a veterinary sur-
b, h, Θ ∈ Rm×1 .                                                  geon or nurse that list why the animal was brought into the
Equation 1 shows how p ∈ [0, 1] controls the tradeoff be-         veterinary practice3 .
tween the tunable task-specific embeddings and static pre-            A dataset of narratives containing the word tick (identified
trained subword and word embeddings, acting as a weighted         using the case-insensitive regex ‘\\W tick \\W’) was iden-
average between both input representations. Here ⊕ signifies      tified. This comprised 27075 narratives which had been read
a concatenation. This is followed by a linear layer with a tanh   and annotated for whether the veterinary surgeon had noted
activation unit, which results in z that we use as input to our   TP (the presence of a tick on the patient in the consulting
model. Note, that in this configuration, we perform ensem-        room). 6,529 records were annotated positive for TP. A fur-
bling at the input with very few additional parameters.           ther set of 1.2 million randomly selected records with no men-
                                                                tion of tick (which were, therefore, considered to be negative
        E = pẼ(w) ⊕ 1 − p ĒS (w) ⊕ ĒW (w)                      for TP but were not manually annotated) were also added to
                                                          (1)    the dataset. We use an 80-20 split for training and testing and
                z = tanh hE, W i + b                              perform 5-fold cross validation on the training data.
   During training at a timestep t ∈ T we then pass word wt
to obtain Etw and subsequently ztw which is then passed to        6    Results
the GRU shown in Equation 2. Here ht−1 is the output of the       Exploratory Analysis Figure 2 shows the log-frequency
GRU hidden state from the previous timestep and hL T denotes      for a range of sentence lengths for all clinical narratives. Each
the output of the last hidden layer L for the hidden state at     narrative can contain anywhere from one relatively long sen-
time T .                                                          tence to an entire paragraph. Therefore, we split the sentences
                                                                     3
                                                                       see here for more information: https://www.liverpool.ac.uk/
         ht = GRU ztw , ht−1 ),     ŷ = φ(hhL
                                                   
                                             T , Θi        (2)    savsnet/


                                                                                                                             32
                                        Log Frequency Distribution                                                                 Train                   Test
                        106
                                                                           Narratives                        Models        Acc.    AUC      F1     Acc.    AUC      F1
                        105                                                Sentences                        Char-CNN       78.13   0.78    77.29   70.27   0.69    68.90
                                                                                                          SubWord-CNN      85.84   0.89    84.38   82.61   0.81    81.15
        Log Frequency

                        104
                                                                                                           Word-CNN        84.49   0.86    83.78   80.44   0.79    80.13
                        103                                                                                 Char-GRU       79.13   0.79    77.37   74.02   0.73    74.92
                        102
                                                                                                          SubWord-GRU      87.24   0.90    87.29   83.47   0.84    82.97
                                                                                                           Word-RNN        84.20   0.83    84.88   79.68   0.77    79.02
                        101
                                                                                                         T-Char-GRU        81.60   0.80    80.18   74.47   0.74    73.89
                        100                                                                            T-SubWord-GRU       84.78   0.83    82.69   76.11   0.75    75.45
                              0   100     200      300    400        500      600       700              T-Word-RNN        86.98   0.89    86.01   76.46   0.75    76.28
                                                Number of Terms
                                                                                                          Ensemble-CNN     86.11   0.89    85.34   83.07   0.83    82.73
       Figure 2: Sentence & Narrative Length Distribution                                                 Ensemble-GRU     88.63   0.91    88.51   84.29   0.82    85.20


into separate instances for training during classification. This                                               Table 2: TP Neural Network Classification Results
is because encoding long paragraphs becomes too difficult for
the RNN to preserve all the information in a single encoding.
Hence, when using an RNN classifier, we average over the                                              fasttext vectors trained on the clinical narratives, as dis-
encodings of each sentence within a single narrative before                                           cussed in the previous section. We find best results are ob-
passing it to the last fully connected layer.                                                         tained using the GRU ensemble based on the overall test per-
                                                                                                      formance (shaded).
Non-ANN Classification Results Table 1 shows the results
of non-neural network based models that include ensemble-                                             7     Conclusion & Future Work
based (Random Forest and Gradient Boosting), large margin
methods (Support Vector Machines) and kernel-based meth-                                              We proposed an ensemble-based neural network to overcome
ods (Gaussian Processes). All models use a combination of tf-                                         the difficulties in inference when dealing with noisy medical
idf scores and unigram frequencies. We find that, in general,                                         data in the form of veterinary clinical notes. Similar base-
most of these models perform similarly, with Support Vec-                                             lines also show good performance, particularly when used
tor Machines with a Radial Basis Function slightly outper-                                            with subword vectors. Recurrent models in general show im-
forming the alternative models. These methods are fast and                                            provements over convolutional neural networks. These mod-
require little memory as these features are essentially counts                                        els can be used to reduce manual labor for medical practi-
(unigram) and normalizations thereof (tf-idf).                                                        tioners by assisting in the decision making process even when
                                                                                                      misspellings are common.
Neural Network Classification Results Table 2 shows                                                      The challenge of class balancing without a degradation in
the classification results when using pretrained embeddings.                                          overall performance is a problem we defer to future work.
Since classes are imbalanced, 72% accuracy is achieved if                                             Specifically, we plan to investigate other strategies to address
the model only predicts the absence of TP. For this reason it                                         imbalanced classes in the presence of noisy medical texts
should be pointed out that although the performance seems                                             using data-augmentation strategies. One such approach in-
relatively accurate, it is a particularly challenging to mitigate                                     volves the use of generative modeling of sentence embed-
false negatives.                                                                                      dings to upsample the minority class with the goal of reducing
   The first section are the model results of CNN models with                                         false positives, but more importantly to reduce true negatives.
pretrained GloVe n-gram character vectors, FastText subword
vectors and skipgram word embeddings trained on Google-                                               8     Acknowledgements
News. The second section are the same input but instead us-
ing GRU networks. In the third section “T” denotes vectors                                            SAVSNET is based at the University of Liverpool. It is cur-
trained on clinical narratives. Lastly, the ensemble models                                           rently funded by the Biotechnology and Biological Sciences
use a combination of both pretrained fasttext vectors and                                             Research Council. The SAVSNET team is grateful to the
                                                                                                      veterinary practices and diagnostic laboratories that provide
                                                                                                      health data and without whose support this research would
                                         Train (10-Fold CV)                             Test          not be possible.
      Models                            Acc.       AUC          F1         Acc.         AUC    F1
   SVM (RBF)                            87.14      0.89         0.87       84.03        0.82   0.84   References
  SVM (Linear)                          85.91      0.84         0.79       81.69        0.82   0.86   [Abdullah et al., 2016] Swaid Abdullah, Chris Helps, Sever-
  Random Forest                         81.62      0.83         0.80       80.49        0.83   0.80
 Gradient Boosting                      85.40      0.83         0.85       84.34        0.85   0.85     ine Tasker, Hannah Newbury, and Richard Wall. Ticks in-
 Gaussian Process                       86.92      0.81         0.83       82.28        0.81   0.82     festing domestic dogs in the uk: a large-scale surveillance
                                                                                                        programme. Parasites & vectors, 9(1):391, 2016.
                                                                                                      [Chawla et al., 2002] Nitesh V Chawla, Kevin W Bowyer,
    Table 1: TP (Non-Neural Network) Classification Results                                             Lawrence O Hall, and W Philip Kegelmeyer. Smote: syn-


                                                                                                                                                                   33
   thetic minority over-sampling technique. Journal of artifi-
   cial intelligence research, 16:321–357, 2002.
[Cohen et al., 2014] Raphael Cohen, Iddo Aviram, Michael
   Elhadad, and Noémie Elhadad. Redundancy-aware topic
   modeling for patient record notes. PloS one, 9(2):e87555,
   2014.
[Giraldo-Rı́os and Betancur, 2018] Cristian       Giraldo-Rı́os
   and Oscar Betancur. Economic and health impact of the
   ticks in production animals. In Ticks and Tick-Borne
   Pathogens. IntechOpen, 2018.
[Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben-
   gio. Understanding the difficulty of training deep feedfor-
   ward neural networks. In Proceedings of the thirteenth in-
   ternational conference on artificial intelligence and statis-
   tics, pages 249–256, 2010.
[Hughes et al., 2017] Mark Hughes, I Li, Spyros Kotoulas,
   and Toyotaro Suzumura. Medical text classification us-
   ing convolutional neural networks. Stud Health Technol
   Inform, 235:246–50, 2017.
[Iyer et al., 2013] Srinivasan V Iyer, Rave Harpaz, Paea LeP-
   endu, Anna Bauer-Mehren, and Nigam H Shah. Mining
   clinical text for signals of adverse drug-drug interactions.
   Journal of the American Medical Informatics Association,
   21(2):353–362, 2013.
[Jameson and Medlock, 2011] Lisa J Jameson and Jolyon M
   Medlock. Tick surveillance in great britain. Vector-Borne
   and Zoonotic Diseases, 11(4):403–412, 2011.
[Kim, 2014] Yoon Kim. Convolutional neural networks for
   sentence classification. arXiv preprint arXiv:1408.5882,
   2014.
[Lohr et al., 2015] B Lohr, I Müller, M Mai, DE Norris,
   O Schöffski, and K-P Hunfeld. Epidemiology and cost
   of hospital care for lyme borreliosis in germany: lessons
   from a health care utilization database analysis. Ticks and
   tick-borne diseases, 6(1):56–62, 2015.
[Roberts et al., 2018] Kirk Roberts, Yuqi Si, Anshul Gandhi,
   and Elmer Bernstam. A framenet for cancer information
   in clinical narratives: Schema and annotation. In Pro-
   ceedings of the Eleventh International Conference on Lan-
   guage Resources and Evaluation (LREC-2018). European
   Language Resource Association, 2018.
[Swart et al., 2014] Arno Swart, Adolfo Ibañez-Justicia, Jan
   Buijs, Sip E van Wieren, Tim R Hofmeester, Hein Sprong,
   and Katsuhisa Takumi. Predicting tick presence by envi-
   ronmental risk mapping. Frontiers in public health, 2:238,
   2014.
[Tulloch et al., 2017] JSP Tulloch, L McGinley, F Sánchez-
   Vizcaı́no, JM Medlock, and AD Radford. The passive
   surveillance of ticks using companion animal electronic
   health records. Epidemiology & Infection, 145(10):2020–
   2029, 2017.
[Yi and Beheshti, 2009] Kwan Yi and Jamshid Beheshti. A
   hidden markov model-based text classification of medical
   documents. Journal of Information Science, 35(1):67–81,
   2009.


                                                                   34