Tick Parasitism Classification from Noisy Medical Records James O’ Neill1∗ , Danushka Bollegala1 , Alan D. Radford2 and PJ Noble3 1 Department of Computer Science, University of Liverpool 2 Department of Infection Biology, University of Liverpool 3 Small Animal Department, University of Liverpool {danushka.bollegala, alanrad, rtnortle}@liverpool.co.uk Abstract would be valuable for surveillance of tick activity and subse- quent disease and in developing clinical decision support for Much of the health information in the medical do- clinicians. main comes in the form of clinical narratives. The The aim of this work is to automate annotation of clinical rich semantic information contained in these notes notes from small animal practice for the presence of TP. We can be modeled to make inferences that assist the are motivated by the fact that using veterinary notes allows us decision making process for medical practitioners, to keep the privacy of the small animals intact while improv- which is particularly important under time and re- ing the quality of assistive diagnosis, and in turn, medication. source constraints. However, the creation of such This is something that is not easily achieved outside of vet- assistive tools is made difficult given the ubiquity erinary practices. of misspellings, unsegmented words and morpho- We are able to take advantage of small animal clini- logically complex or rare medical terms. This re- cal records collected through the Small Animal Veterinary duces the coverage of vocabulary terms present in Surveillance Network (SAVSNET). Narratives in the SAVS- commonly used pretrained distributed word repre- NET corpus are currently screened using simple regular- sentations that are passed as input to parametric expressions to identify mentions of the word ’tick’. These models that makes such predictions. This paper mentions may refer to a tick present on a pet or simply to dis- presents an ensemble architecture that combines in- cussion of tick prevention and need to be manually annotated domain and general word embeddings to overcome accordingly. these challenges, showing best performance on a We propose a dynamic ensemble neural network that learns binary classification task when compared to vari- to classify TP from imprecise clinical narratives. Our ap- ous other baselines. We demonstrate our approach proach incorporates fine-tuned in-domain word embeddings in the context of the veterinary domain for the task and domain-agnostic pretrained embeddings to improve clas- of identifying tick parasitism from small animals. sification performance. A parameter is used to learn a The best model shows 84.29% test accuracy, show- weighted combination of each embedding in the overall clas- ing some improvement over models, which only sification. use pretrained embeddings that are not specifically trained for the medical sub-domain of interest. Contributions Our contributions are as follows: 1. A novel application of text classification for noisy clini- cal narratives, specifically in the veterinary domain. 1 Introduction 2. To the best of our knowledge, the first known attempt to Clinical narratives contain important and useful information predict TP on animals from textual descriptions. about the health of a subject. Medical practitioners often have to spend a considerable amount of time reading these notes 3. An ensemble approach that combines domain agnostic to make informed decisions, which can be quite laborious. and domain specific representations (n-gram character, Parametric models can be used to extract information that subword and word vectors) to recurrent neural network can assist medical experts in decision-making while reducing architecture and strong baselines of both non-neural net- this burden. However, spelling mistakes, complex medical work and neural network classifiers. terms and rare terms are ubiquitous in such clinical narra- tives [Roberts et al., 2018]. 2 Background Tick parasitism (TP) is commonly seen in veterinary pa- 2.1 Tick-borne disease tients. Given that ticks can transmit a variety of diseases including important zoonotic disease (e.g. Lyme’s disease), Tick-borne diseases (TBD) are caused by a variety of tools to screen clinical records for reporting of tick parasitism pathogens (bacteria, viruses, rickettsia and protozoa) which are transmitted through tick-bites. Identifying and prevent- ∗ james.o-neill@liverpool.ac.uk ing TBD from spreading is difficult given that ticks have a Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 30 wide geographic range and are highly adaptive to changing create a timeline of adverse drug event mentions per patient. environments allowing this range to increase. Globally, sig- This was used to identify drug-drug-event associations for nificant TBD include tularaemia and rocky mountain spotted 1,165 drugs and 14 events. fever. In a recent study using electronic health records from Other methods that do not use text have relied on geospa- cats and dogs, TP was seen most commonly in the south- tial data for modelling tick presence [Swart et al., 2014]. central region of England with a peak activity in summer and The model predicted the presence of ticks within a 1 km2 a smaller peak in cats in Autumn [Tulloch et al., 2017]. grid from field data using satellite-based methodology with Bayesian priors chosen over landscape types. Ticks were es- 2.2 Economic Impact of Tick Parasitism timated for 54% land cover, finding a 37% presence from all A recent study from Germany highlighted a potential cost of 677 coordinates sampled. > 30M Euros resulting from Lyme borreliosis alone [Lohr Lastly, recent work has performed medical text classi- et al., 2015]. In addition, recent work reviewed the impact of fication with convolutional neural networks (CNN) with TP for production animals that are required to meet standard Word2Vec as input [Hughes et al., 2017]. This was shown to health conditions [Giraldo-Rı́os and Betancur, 2018], reduc- outperform Logistic Regression that uses Doc2Vec represen- ing the survival rate of the animal and the production of meat, tations or Bag of Words (BOW) based Word2Vec approaches. milk, eggs etc. (not to mention costs incurred for treatment). They use k-means clustering (k=1000) to generate a feature vocabulary that is used to generate a soft assignment BOW 2.3 Small Animal Ticks in the UK histogram for each sentence. These features were then used In this work, we focus on small animals within the United as input to the Logistic regression model. The BOW with Lo- Kingdom (UK). In the past decade, there has been an on- gistic Regression yielded the best baseline results with 51% going effort to collect ticks by the public, veterinary health test accuracy, which was still 17% percentage points lower agencies and practitioners within the UK as part of the Tick than the proposed CNN model, which uses Word2Vec. How- Surveillance Scheme and The Big Tick Project [Jameson and ever, there is no use of recurrent models for preserving the Medlock, 2011; Abdullah et al., 2016] in an effort to iden- sequential nature of text. tify various tick species (predominantly from companion ani- mals) and their locality across various regions in the UK. Al- 4 Methodology though these projects demonstrate the viability of nationwide 4.1 Models surveillance programs to monitor tick species, we argue that Model Configurations the requirement for active manual participation by contribu- tors and lack of automation along with inconsistent/aperiodic For all the below models we use the Binary Cross Entropy data-collection present barriers to participation and to colla- (BCE) loss, with a learning rate η = 0.001 and Adaptive tion of representative data in the long term. Using an au- Momentum (adam) for optimization. Dropout is used for tomated system to screen clinical notes for tick parasitism regularization on all layers (not including the input) with a will enhance tick-surveillance. Furthermore this work will dropout rate pd = 0.2. The pretrained embeddings used provide a model for systems that might be used for clinical are fixed throughout training (i.e no gradient udpates). The note summarisation and, subsequently, for clinical decision- batch size |xs | = 200 for each model. Given the class imbal- making support. ance, we choose to weight the losses inversely proportional to the frequency of each class during each mini-batch up- date. This avoids other alternative approach such as sampling 3 Related Research methods [Chawla et al., 2002] with little cost. Most previous work on medical text classification has focused Convolutional Neural Network on cleaner text, which are extracted from more formal regis- ters. Although, there has been recent work that has explored We test CNNs for text classification, which too have been classification on health records and notes which we include used in the medical domain (as aforementioned [Hughes et below. al., 2017]), first proposed by Kim et al. (2014). The CNN model uses 100 2d filters each for kernels of size (2, 300) and A key challenge is making use of information rich notes (3, 300) for character n-gram embeddings (GloVe), subword while reducing redundancy contained in the corpus due to embeddings (FastText)1 and word embeddings (Word2Vec 2 ), the copying of notes which can lead to a degradation in per- all of which are dw = 300. formance. In the context of topic modelling, prior work has Our motivation for using FastText is that subword embed- applied a variant of Latent Dirichlet Allocation (LDA) to pa- dings are first learned to create word embeddings and there- tient record notes [Cohen et al., 2014], which they refer to as fore mitigates the problem of misspellings, while they are Red-LDA. Red-LDA removes this redundancy and improves also used to deal with out-of-vocabulary terms (a new word is on topic coherence and qualitative assessments in comparison likely to share some subwords with the words already in the to standard LDA. vocabulary). ReLU activations are used with 1d max pooling Yi and Beheshti used a hidden Markov model for classify- after each layer followed by a concatenation of the last layer ing medical documents that incorporates prior knowledge in features. the form of medical subject headings. 1 Iyer et al. have performed text mining on clinical text pretrained-fasttext:https://fasttext.cc/docs/en/crawl-vectors.html 2 for drug-event recognition from 50 million clinical notes to pretrained-skipgram: https://code.google.com/archive/p/word2vec/ 31 Gated Recurrent Network As a second baseline approach, we test recurrent architectures with memory networks to preserve any non-local dependen- cies between terms, which we would expect to further im- prove performance. The Gated Recurrent (GRU) model uses 2-hidden layers where the last output layer (1, 300) is passed to a dense layer. The weights are initialized using Xavier normalisation [Glorot and Bengio, 2010] (µ=0,σ=0.01) and tanh activation units are used. Figure 1: GRU-based Ensemble Architecture (red corresponds to the Ensembled Feature Approach last hidden state vector that outputs embedding for both in-domain and large pretrained embedding inputs) In the above two aforementioned models, the challenge of poorly typed notes is addressed using n-gram character vec- tors, sub-word vectors and word vectors. In the ensemble ap- In contrast, we also consider passing each embedding sepa- proach shown in Figure 1, we combine the latter two by con- rately and instead perform ensembling at the output as shown catenating both final hidden layer encodings (red) and pass in Equation 3 (and shown in Figure 1), in which case Θ ∈ it to a dense layer (green) before making the final prediction R3m . In our experiments, we found the latter of these two ŷ. This allows for interaction terms among both sentence en- approaches to outperform the former. codings created by sub-word and word vectors. For regular- ization, we also use dropout in this dense layer with a rate ŷ = φ(hh̃t ⊕ h̄S W  pd = 0.5, while other layers are kept at pd = 0.2 as previ- t ⊕ h̄t , Θi (3) ously mentioned. Binary Cross Entropy (BCE) loss is then used as the ob- We also evaluate this approach when combining in-domain jective, as shown in Equation 4 where N is the number of word embeddings trained on the clinical narratives and pre- samples in a given mini-batch update. trained embeddings. This allows us to systematically com- bine the benefits of both vector representation by simply N adding a dense layer that acts a weighted combination of both 1 X `(y, ŷ) = − yi log(ŷi ) + (1 − yi ) log(1 − ŷi ) (4) sentence embeddings to produce the final encoding. In a sim- N i=1 ilar fashion we carry this ensemble method out for the pre- viously mentioned 2-hidden layer Convolutional Neural Net- 5 Experimental Data work. Collected Dataset We demonstrate our method on the task Below we summarize the steps in Equation 1 where E is of identifying TP in clinical records from animals, which an embedding matrix, Ẽ is a fine-tunable E and ẼS , ẼW to our knowledge is a novel application of text-based ma- are both subword and word pretrained embeddings repsec- chine learning for this problem. The Small Animal Veterinary tively, which are not updated during training. The input to- Surveillance Network (SAVSNET) dataset contains approxi- kens w ∈ Rn are passed to the embedding matrix E ∈ Rn×d mately 3.5 million records. The health records are submitted which are then transformed with parameters W ∈ Rd×m and to SAVSNET at the end of consultations by a veterinary sur- b, h, Θ ∈ Rm×1 . geon or nurse that list why the animal was brought into the Equation 1 shows how p ∈ [0, 1] controls the tradeoff be- veterinary practice3 . tween the tunable task-specific embeddings and static pre- A dataset of narratives containing the word tick (identified trained subword and word embeddings, acting as a weighted using the case-insensitive regex ‘\\W tick \\W’) was iden- average between both input representations. Here ⊕ signifies tified. This comprised 27075 narratives which had been read a concatenation. This is followed by a linear layer with a tanh and annotated for whether the veterinary surgeon had noted activation unit, which results in z that we use as input to our TP (the presence of a tick on the patient in the consulting model. Note, that in this configuration, we perform ensem- room). 6,529 records were annotated positive for TP. A fur- bling at the input with very few additional parameters. ther set of 1.2 million randomly selected records with no men-   tion of tick (which were, therefore, considered to be negative E = pẼ(w) ⊕ 1 − p ĒS (w) ⊕ ĒW (w) for TP but were not manually annotated) were also added to  (1) the dataset. We use an 80-20 split for training and testing and z = tanh hE, W i + b perform 5-fold cross validation on the training data. During training at a timestep t ∈ T we then pass word wt to obtain Etw and subsequently ztw which is then passed to 6 Results the GRU shown in Equation 2. Here ht−1 is the output of the Exploratory Analysis Figure 2 shows the log-frequency GRU hidden state from the previous timestep and hL T denotes for a range of sentence lengths for all clinical narratives. Each the output of the last hidden layer L for the hidden state at narrative can contain anywhere from one relatively long sen- time T . tence to an entire paragraph. Therefore, we split the sentences 3 see here for more information: https://www.liverpool.ac.uk/ ht = GRU ztw , ht−1 ), ŷ = φ(hhL  T , Θi (2) savsnet/ 32 Log Frequency Distribution Train Test 106 Narratives Models Acc. AUC F1 Acc. AUC F1 105 Sentences Char-CNN 78.13 0.78 77.29 70.27 0.69 68.90 SubWord-CNN 85.84 0.89 84.38 82.61 0.81 81.15 Log Frequency 104 Word-CNN 84.49 0.86 83.78 80.44 0.79 80.13 103 Char-GRU 79.13 0.79 77.37 74.02 0.73 74.92 102 SubWord-GRU 87.24 0.90 87.29 83.47 0.84 82.97 Word-RNN 84.20 0.83 84.88 79.68 0.77 79.02 101 T-Char-GRU 81.60 0.80 80.18 74.47 0.74 73.89 100 T-SubWord-GRU 84.78 0.83 82.69 76.11 0.75 75.45 0 100 200 300 400 500 600 700 T-Word-RNN 86.98 0.89 86.01 76.46 0.75 76.28 Number of Terms Ensemble-CNN 86.11 0.89 85.34 83.07 0.83 82.73 Figure 2: Sentence & Narrative Length Distribution Ensemble-GRU 88.63 0.91 88.51 84.29 0.82 85.20 into separate instances for training during classification. This Table 2: TP Neural Network Classification Results is because encoding long paragraphs becomes too difficult for the RNN to preserve all the information in a single encoding. Hence, when using an RNN classifier, we average over the fasttext vectors trained on the clinical narratives, as dis- encodings of each sentence within a single narrative before cussed in the previous section. We find best results are ob- passing it to the last fully connected layer. tained using the GRU ensemble based on the overall test per- formance (shaded). Non-ANN Classification Results Table 1 shows the results of non-neural network based models that include ensemble- 7 Conclusion & Future Work based (Random Forest and Gradient Boosting), large margin methods (Support Vector Machines) and kernel-based meth- We proposed an ensemble-based neural network to overcome ods (Gaussian Processes). All models use a combination of tf- the difficulties in inference when dealing with noisy medical idf scores and unigram frequencies. We find that, in general, data in the form of veterinary clinical notes. Similar base- most of these models perform similarly, with Support Vec- lines also show good performance, particularly when used tor Machines with a Radial Basis Function slightly outper- with subword vectors. Recurrent models in general show im- forming the alternative models. These methods are fast and provements over convolutional neural networks. These mod- require little memory as these features are essentially counts els can be used to reduce manual labor for medical practi- (unigram) and normalizations thereof (tf-idf). tioners by assisting in the decision making process even when misspellings are common. Neural Network Classification Results Table 2 shows The challenge of class balancing without a degradation in the classification results when using pretrained embeddings. overall performance is a problem we defer to future work. Since classes are imbalanced, 72% accuracy is achieved if Specifically, we plan to investigate other strategies to address the model only predicts the absence of TP. For this reason it imbalanced classes in the presence of noisy medical texts should be pointed out that although the performance seems using data-augmentation strategies. One such approach in- relatively accurate, it is a particularly challenging to mitigate volves the use of generative modeling of sentence embed- false negatives. dings to upsample the minority class with the goal of reducing The first section are the model results of CNN models with false positives, but more importantly to reduce true negatives. pretrained GloVe n-gram character vectors, FastText subword vectors and skipgram word embeddings trained on Google- 8 Acknowledgements News. The second section are the same input but instead us- ing GRU networks. In the third section “T” denotes vectors SAVSNET is based at the University of Liverpool. It is cur- trained on clinical narratives. Lastly, the ensemble models rently funded by the Biotechnology and Biological Sciences use a combination of both pretrained fasttext vectors and Research Council. The SAVSNET team is grateful to the veterinary practices and diagnostic laboratories that provide health data and without whose support this research would Train (10-Fold CV) Test not be possible. Models Acc. AUC F1 Acc. AUC F1 SVM (RBF) 87.14 0.89 0.87 84.03 0.82 0.84 References SVM (Linear) 85.91 0.84 0.79 81.69 0.82 0.86 [Abdullah et al., 2016] Swaid Abdullah, Chris Helps, Sever- Random Forest 81.62 0.83 0.80 80.49 0.83 0.80 Gradient Boosting 85.40 0.83 0.85 84.34 0.85 0.85 ine Tasker, Hannah Newbury, and Richard Wall. Ticks in- Gaussian Process 86.92 0.81 0.83 82.28 0.81 0.82 festing domestic dogs in the uk: a large-scale surveillance programme. Parasites & vectors, 9(1):391, 2016. [Chawla et al., 2002] Nitesh V Chawla, Kevin W Bowyer, Table 1: TP (Non-Neural Network) Classification Results Lawrence O Hall, and W Philip Kegelmeyer. Smote: syn- 33 thetic minority over-sampling technique. Journal of artifi- cial intelligence research, 16:321–357, 2002. [Cohen et al., 2014] Raphael Cohen, Iddo Aviram, Michael Elhadad, and Noémie Elhadad. Redundancy-aware topic modeling for patient record notes. PloS one, 9(2):e87555, 2014. [Giraldo-Rı́os and Betancur, 2018] Cristian Giraldo-Rı́os and Oscar Betancur. Economic and health impact of the ticks in production animals. In Ticks and Tick-Borne Pathogens. IntechOpen, 2018. [Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben- gio. Understanding the difficulty of training deep feedfor- ward neural networks. In Proceedings of the thirteenth in- ternational conference on artificial intelligence and statis- tics, pages 249–256, 2010. [Hughes et al., 2017] Mark Hughes, I Li, Spyros Kotoulas, and Toyotaro Suzumura. Medical text classification us- ing convolutional neural networks. Stud Health Technol Inform, 235:246–50, 2017. [Iyer et al., 2013] Srinivasan V Iyer, Rave Harpaz, Paea LeP- endu, Anna Bauer-Mehren, and Nigam H Shah. Mining clinical text for signals of adverse drug-drug interactions. Journal of the American Medical Informatics Association, 21(2):353–362, 2013. [Jameson and Medlock, 2011] Lisa J Jameson and Jolyon M Medlock. Tick surveillance in great britain. Vector-Borne and Zoonotic Diseases, 11(4):403–412, 2011. [Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [Lohr et al., 2015] B Lohr, I Müller, M Mai, DE Norris, O Schöffski, and K-P Hunfeld. Epidemiology and cost of hospital care for lyme borreliosis in germany: lessons from a health care utilization database analysis. Ticks and tick-borne diseases, 6(1):56–62, 2015. [Roberts et al., 2018] Kirk Roberts, Yuqi Si, Anshul Gandhi, and Elmer Bernstam. A framenet for cancer information in clinical narratives: Schema and annotation. In Pro- ceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC-2018). European Language Resource Association, 2018. [Swart et al., 2014] Arno Swart, Adolfo Ibañez-Justicia, Jan Buijs, Sip E van Wieren, Tim R Hofmeester, Hein Sprong, and Katsuhisa Takumi. Predicting tick presence by envi- ronmental risk mapping. Frontiers in public health, 2:238, 2014. [Tulloch et al., 2017] JSP Tulloch, L McGinley, F Sánchez- Vizcaı́no, JM Medlock, and AD Radford. The passive surveillance of ticks using companion animal electronic health records. Epidemiology & Infection, 145(10):2020– 2029, 2017. [Yi and Beheshti, 2009] Kwan Yi and Jamshid Beheshti. A hidden markov model-based text classification of medical documents. Journal of Information Science, 35(1):67–81, 2009. 34