Using an Ensemble of Generalised Linear and Deep Learning Models in the SMM4H 2017 Medical Concept Normalisation Task Maksim Belousov1 , William Dixon2,3 , Goran Nenadic1,2 1 School of Computer Science, The University of Manchester, UK; 2 Health eResearch Centre, Farr Institute, Manchester Academic Health Science Centre, The University of Manchester, UK 3 Arthritis Research UK Centre for Epidemiology, The University of Manchester, UK Abstract This paper describes a medical concept normalisation system developed for the 2nd Social Media Mining for Health Applications Shared Task 3. The proposed system contains three main stages: lexical normalisation, word vectorisation and classification. The lexical normalisation stage was aimed to correct spelling mistakes and maximise the coverage of pre-trained word embeddings utilised to generate word vectors in the following stage. We experimented with three different classification models. The multinomial logistic regression model achieved higher accuracy than the recurrent neural networks with gated recurrent unit. However, the ensemble of both classification models based on the mean rule achieved the highest accuracy of 0.885 on the test dataset. Introduction Online health communities and social media platforms such as Twitter are often used by patients to discuss various health-related experiences including personal health conditions and adverse drug reactions. However, patients rarely use official medical terms to express their symptoms and rather use descriptive expressions that explain how they feel (e.g “kills my stomach” often refers to Abdominal pain and “feel like everything that surrounds me are circling or rolling” refers to Vertigo). Attempts to mention diseases using medical terminology often result in misspelled variations (e.g. “tackacardia” instead of Tachycardia). The medical concept normalisation task aims to map a layman description of a medical condition to a correspond- ing concept in a standard medical dictionary such as MedDRA R ∗ , the Medical Dictionary for Regulatory Activities (www.meddra.org). Formally, this task can be reduced to multi-class classification problem with extremely large number of classes (for example, MedDRA has over 20,000 preferred terms in total). Since terminologies are often organised hierarchically, concept normalisation problem sometimes can be solved as hierarchical classification prob- lem. Still some medical concepts are similar to other concepts (e.g. “Hunger” and “Increased appetite”) so it is often difficult to have a unique mapping without a wider context in which a particular disease or symptom has been described. Traditional approaches for concept normalisation are mostly based on string matching, such as rule-based term vari- ation mapping1 or learning edit distances between phrases2, 3 . A recent study4 , however, has demonstrated that deep learning models, particularly convolutional (CNN) and recurrent neural networks (RNN) with pre-trained word embed- dings obtained from large text corpora significantly outperform previous state-of-the-art concept normalisation models on social media data. Similarly, a combination of pre-trained generic and target domain (i.e. related to the specific task) embeddings has been shown to improve the performance of sentence classification in the medical domain5 . The goal of the 2nd Social Media Mining for Health Applications Shared Task 3 is to identify MedDRA Preferred Term (PT) code for a given colloquial or other mention obtained from drug-related discussions on Twitter. The proposed ensemble system combines generalised linear and deep learning models that have been trained on both generic and target domain word embeddings. System architecture The system architecture consists of three stages: preprocessing, word vectorisation and classification. The prepro- cessing stage aims addressing challenges related to noisy text and is focused on lexical normalisation (i.e. spelling correction, abbreviation expansion, slang conversion), stemming and stop words removal. During the word vectori- sation stage, all words in preprocessed sentences are converted into corresponding vector-space representations that ∗ MedDRA R is a registered trademark of the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA). are later utilised as features. Finally, the classification stage is aimed to predict a target concept and consists of an ensemble of multiple classifiers. Lexical normalisation As any other social media posts, health-related discussions also have the characteristics of informal communications such as irregular grammar, misspellings, abbreviations and slang. To this end, the lexical normalisation component aimed to reduce the noise and to maximise the effectiveness (i.e. coverage) of pre-trained word embeddings used in the following stage. Particularly, our lexical normalisation pipeline utilises three types of external resources: • Vocabulary is a set of known words, used to identify unknown (or out-of-vocabulary) words (i.e. candidates for correction). It could be a list of all English words and medical terms. However, we narrowed it down to vocabulary from a given pre-trained word embedding model. • Mappings are represented as translations from one word (or word form) to another, such as abbreviated to expanded forms (e.g. “hbp” to “high blood pressure”) or interjection to synonymous words or phrases (e.g. “ouchy” to “hurt”). Particularly, we used abbreviations and translations collected from the Internet & Text Slang Dictionary (noslang.com) and extended it with manually curated list of popular medical abbreviations and slang observed in the training data. • Language models are used to calculate a probability score of corrected phrase candidates and pick the best candidate based on the combined ranking from all models. We have utilised three different language models: a trigram model generated from Twitter drug discussions6 , a trigram model generated on 1 million sentences from popular support groups on health-related social networking site DailyStrength (www.dailystrength.org) and a bigram model generated on medical expressions parsed from DrugInformer, a search engine for pharmaceutical products and their side effects (www.druginformer.com). Word vectorisation Vectorisation is a step in which all words are converted into a numeric vector representation that can be used as features to train a machine learning classification model. We utilised word2vec7 embeddings that automatically learn hierarchical representations of words by training a recurrent neural network. In the proposed classifiers we have used the following models (two of them were trained on data from generic domains and one was trained on a target domain, namely Twitter drug discussions): • GoogleNews: 300-dimensional vectors obtained from a model trained on 3 million words and phrases from Google News8 • Twitter: 400-dimensional vectors obtained from a model trained on 400 million tweets9 • DrugTwitter: 150-dimensional vectors learned from 1 million user sentences about drugs on Twitter10 The word vectors obtained from the pre-trained models were utilised differently depending on the classification algo- rithm. Classification To predict the most suitable medical concept corresponding to the textual description of the health condition, we have used an ensemble model that combines multiple base classifiers using the mean (averaging) rule. Namely, the final prediction is made based on the highest average value for each class derived from predicted probabilities of the base learners: T 1X hfinal (x) = argmax dt,j (x) (1) j T t=1 In particular, we have applied this ensemble method in several places in the system to utilise multiple word embeddings in a multinomial logistic regression model and also to combine predictions of our base classifiers into the final system. We have used three different prediction models for medical concept normalisation (which correspond to the three runs submitted for evaluation): • MultiLogReg: Multinomial logistic regression is a model that generalise logistic regression by allowing more than two discrete outcomes that makes it is suitable for multi-class problems. In order to represent the entire phrase as a vector, the mean of a set of corresponding weighted word vectors (or zero vector for unknown words) is calculated, where word weights were calculated as inverse document frequency that shows whether the word is common or rare across all phrases. The averaging rule was applied to combine target and generic domain embeddings from three different pre-trained models (GoogleNews, Twitter and DrugTwitter). Particularly, the word vectors obtained from each word2vec model were used to train multiple logistic regression classifiers and their predictions were combined. The logistic regression model was trained using limited-memory BFGS optimisation11 , limited to 100 iterations. • Bi-GRU: Recurrent neural networks (RNN) have an architecture designed to handle sequences of variable lengths and therefore have been successfully used in many natural language processing tasks. Bidirectional Gated Recurrent Unit (GRU)12 is aimed to increase the amount of input information by performing a forward and backward pass over the sequence, where backward hidden states are calculated by feeding the input sequence in the backward order. For this model we utilised only the word vectors obtained from the GoogleNews model, since it was trained on the largest corpora and yielded the highest performance during the preliminary evaluation on the development set. We set number of units in the GRU layer to 70% of embedding dimension. The model was trained using AdaGrad13 optimisation algorithm with the learning rate of 0.01. For regularisation, dropout with the rate of 0.1 was applied on the Bi-GRU output. • Ensemble: The proposed ensemble model is aimed to utilise the predictive power of both MultiLogReg and Bi-GRU models by combining their predictions using the averaging rule shown in Equation 1. Data The training dataset for this task contains 6,650 phrases mapped to 472 concepts (14.09 phrases per concept in average, the most popular concept Insomnia contains 634 phrases, whereas 170 concepts have only single mention). The average length of phrase is 2 tokens (min: 1, max: 22). The testing dataset contains 2,500 phrases. Results and Discussion Table 1 shows the evaluation accuracy of the three models on the test dataset. The multinomial logistic regres- sion model (MultiLogReg) trained on both generic and target embeddings outperformed the Bidirectional GRU (Bi-GRU) model trained only on the GoogleNews embeddings. However, the ensemble model yielded the highest accuracy score. This suggests that MultiLogReg and Bi-GRU learn slightly different information which leads to different predictions. The ensemble model was able to pick the correct candidate in majority of cases. Table 1: Test accuracy (%) of proposed models Run Test accuracy MultiLogReg 0.877 Bi-GRU 0.855 Ensemble 0.885 We have presented comparison of predicted concepts by different models and gold-standard labels for the test data in Table 2. For example, “weight nightmare” was classified as Nightmare by multinomial logistic regression model, however, despite the mention of a nightmare and lack of information about the fact that the weight was increased, two other models correctly associated it with the weight gain. In the case when the concept of formication was described as “feeling like there’s bugs under my skin” all systems incorrectly associated it with epidermal and dermal condi- tions, however, despite the “skin” mention, it is actually a neurological disorder. In other cases, when none of the systems has made the correct prediction, at least one of them associated it with a similar concept. For example, “taste buds aren’t working” was predicted as Dysgeusia (a distortion of the sense of taste) that is associated with the correct concept (Ageusia, a complete lack of taste). Table 2: Examples of phrases and their corresponding actual and predicted concepts Correct concept Phrase MultiLogReg Bi-GRU Ensemble Feeling of despair impending sense of doom Somnolence Feeling abnormal Feeling abnormal Formication feel like there’s bugs under my skin Pruritus Photosens. reaction Photosens. reaction Weight increased weight nightmare Nightmare Weight increased Weight increased Abdom. discomfort stomach feel weird Feel abnormal Abdom. discomfort Abdom. discomfort Ageusia taste buds aren’t working Drug ineffect. Dysgeusia Dysgeusia Inj. site pain humira injection redness Inj. site pain Burning sens. Inj. site infl. Fatigue aren’t I tired Insomnia Fatigue Fatigue Fatigue me so painfully exhausted Fatigue Insomnia Insomnia * Correct predictions are marked. Conclusions We presented an ensemble system that combines generalised linear and deep learning models for medical concept normalisation in the context of the 2nd Social Media Mining for Health Applications Shared Task 3. The lexical normalisation was performed prior to the classification in order to reduce the noise and maximise the coverage of pre-trained word embeddings generated on generic and target domains. The multinomial logistic regression model achieved higher accuracy than the recurrent neural networks with gated recurrent unit. However, the ensemble of both classifiers based on mean rule yielded the highest performance. References 1. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium. American Medical Informatics Association; 2001. p. 17. 2. McCallum A, Bellare K, Pereira F. A conditional random field for discriminatively-trained finite-state string edit distance. arXiv preprint arXiv:12071406. 2012;. 3. Ristad ES, Yianilos PN. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20(5):522–532. 4. Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Repre- sentation. In: ACL (1); 2016. . 5. Limsopatham N, Collier N. Modelling the combination of generic and target domain embeddings in a convolu- tional neural network for sentence classification. Association for Computational Linguistics; 2016. . 6. Sarker A, Gonzalez G. A corpus for mining drug-related knowledge from Twitter chatter: language models and their utilities. Data in brief. 2017;10:122–131. 7. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;. 8. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119. 9. Godin F, Vandersmissen B, De Neve W, Van de Walle R. Multimedia lab@ acl w-nut ner shared task: named entity recognition for twitter microposts using distributed word representations. ACL-IJCNLP. 2015;2015:146–153. 10. Nikfarjam A, Sarker A, OConnor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association. 2015;22(3):671–681. 11. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing. 1995;16(5):1190–1208. 12. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder- decoder approaches. arXiv preprint arXiv:14091259. 2014;. 13. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. 2011;12(Jul):2121–2159.