A Hybrid Approach based Sentiment Extraction from Medical Contexts 1Anupam Mondal 2Ranjan Satapathy 1Dipankar Das 1Sivaji Bandyopadhyay 1 Computer Science and Engineering, Jadavpur University, India 1 anupam@sentic.net, 1ddas@cse.jdvu.ac.in, 1sbandyopadhyay@cse.jdvu.ac.in 2 School of Computer and Information Sciences, University of Hyderabad, India 2 kumarsatpathy@gmail.com Abstract specific structured corpus, the task is challenging in Bio- NLP domain. To overcome the scarcity of such domain spe- In the domain of Bio medical Natural Language cific knowledge for sentiment analysis, several lexicons Processing (Bio-NLP), the information extraction have been developed like Medical Event Net (MEN), Medi- and context sentiment identification are treated as cal Fact Net (MFN), Medical Belief Net (MBN) and Word- emerging tasks. Several linguistic features like ne- Net of Medical Event (WME) [Cambria et al., 2010]. These gation, uni-gram, bi-gram, Part-of-Speech (POS) lexicons help to extract the sense of a medical concept, fact have been used to extract the medical concepts and and belief oriented information. The present paper reports their sense-based context level information. Thus, the development of a medical context based sentiment ex- in the present attempt, a hybrid approach which is traction model. Hence, one of our primary aims is to identi- the combination of both linguistic and machine fy the sense-based concepts from the medical contexts and learning approaches has been introduced to extract extract their related sentiment features. In order to identify the contextual sense-based information from a the sense-based medical concepts, we have introduced the medical corpus. The extraction of sentiment orient- current version of WordNet of Medical Event (WME2.0) ed keywords is the crucial part towards identifying knowledge base. WME2.0 contains the medical concept the senses of medical contexts. In our previous information with their related linguistic and sense-oriented work, we have developed a medical sense-based features like POS, gloss of the concept, semantics, polarity lexicon known as WordNet of Medical Event score, affinity score, gravity score and sense(s). Among all (WME). Several sentiment lexicons like Senti- these features, we have only considered the sense-based WordNet, SenticNet etc. were used to represent features like semantics, polarity score, affinity score and WME. In contrast, one of our primary motivations sense to develop our present sentiment extraction model here is to build a sentiment extraction model based [Swaminathan et al., 2010]. On the top of extracted medical on medical contexts to leverage the knowledge of concepts based on WME2.0 lexicon, we have applied lin- WME using a hybrid approach. The developed guistic and machine learning approaches to get the final model is based on two phases, namely pre- sentiment of the contexts. The linguistic approach helps to processing phase and learning phase. The prepro- manage the negation of the contexts as well as derive new cessing phase is responsible for extracting and pre- rules to extract the sense(s) of such contexts. The POS, uni- paring structural data from the raw contexts where- gram, bi-gram, affinity score, polarity score and sense fea- as the learning phase helps to identify the senti- tures of the medical concepts of WME2.0 help to extract the ment patterns and evaluate the sentiment extraction sentiment of the medical contexts. The supervised machine process. The two phased hybrid model provides us learning approach has been introduced to verify the contex- 81% accuracy for extracting the sentiment based tual sentiment extracted using linguistic approach. In the medical contexts as positive and negative by em- process, we have applied NaïveBayes and Sequential mini- ploying NaïveBayes and Sequential minimal opti- mal optimization (SMO) supervised machine learning clas- mization (SMO) supervised classifiers. sifiers on the derived linguistic features. In the paper, we have incorporated both linguistic and 1 Introduction machine learning approaches together as a hybrid model to One of the major objectives of Sentiment Analysis is to leverage the sentiment oriented knowledge of both the do- identify and extract the subjective information from a given main [Villena-Romn et al., 2011]. The proposed hybrid text using rule based or machine learning approaches [Cam- model follows two phase architecture namely pre- bria, 2016]. The domain specific knowledge with above processing phase and learning phase. In pre-processing mentioned approaches help us to extract the contextual sen- phase, we have focused on the preparation of structured timent information from the medical corpus. Due to lack of medical concepts from the raw medical contexts and the involvement of domain experts and unavailability of domain 35 Proceedings of the 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2016), IJCAI 2016, pages 35-40, New York City, USA, July 10, 2016. learning phase helps to extract the sentiment of such con- combination of linguistic and machine learning approaches texts and evaluate them. The two phase model generates the [Boytcheva et al., 2005; Villena-Romn et al., 2011]. Sohn et output in the form of positive or negative sentiment of the al., 2012, developed an emotion identification system from context. The hybrid approach based learning phase provides suicide notes using the hybrid approach [Sohn et al., 2012]. 81% accuracy to extract the medical context based senti- The suicide notes were provided by the challenge organizers ment information. of Informatics for Integrating Biology and the Bedside The remainder of the paper is structured as follows, Sec- (I2B2). Machine learning, linguistic rule-based and their tion 2 presents related work followed by model design de- combined approaches have been applied to the training da- scribing the pre-processing and learning phases in Section 3. taset of the suicide notes and the system provided 0.5640 Section 4 talks about the model discussion and evaluation micro-average F-score for the training dataset. Birks et al., process we have followed in the paper. Finally, in Section 5, 2009, applied the combination of RIPPER (Repeated Incre- we present our conclusion and future scopes of the model. mental Pruning to Produce Error Reduction), multinomial NaïveBayes classifier and manual pattern matching rules 2 Related Work to identify the emotions of the sentences [Birks et al., 2009]. Sentiment analysis of medical contexts is contributory and Mondal et al., 2016, developed WordNet of Medical Events growing research field under Bio-NLP domain [Cambria et (WME) lexicon to identify the medical concepts and their al., 2013]. A large number of unstructured corpora and lack knowledge-based and semantic features using hybrid ap- of domain experts’ involvement have introduced more chal- proach [Mondal et al., 2015]. The latest version of WME lenge in this task. In the process, the researchers focused on (WME2.0) contains POS, semantics, gloss, affinity score, developing medical sentiment-based lexicon to identify the gravity score, polarity score and sense features of the con- sentiments of medical concepts. Therefore, the medical con- cepts [Mondal et al., 2016]. WME2.0 sentiment lexicon has cepts and their sense based features indeed help to identify identified the senses of the concepts using SentiWordNet 1, the sentiment of the medical contexts. The linguistic, ma- SenticNet2, BingLiu3 and Taboda’s adjective list [Mondal et chine learning and hybrid approaches have been introduced al., 2016; Mondal et al., 2015; Taboada et al., 2011]. In this to build the concept and context based sentiment extraction paper, we have used the WME2.0 lexicon to identify the systems. The linguistic approach helps to find the negation concepts and their features to extract sentiments of the med- words, phrases and construct the knowledge-based rules ical contexts. (with unigram, bigram and n-gram features) for the context level sentiment extraction [Elkin et al., 2005; Niu et al., Figure 1: Two phase proposed Model 2005; Szarvas et al., 2008]. Smith and Fellbaum, 2004 de- veloped a Medical Word-Net (MEN) along with two sub- networks, namely Medical FactNet (MFN) and Medical BeliefNet (MBN), for the evaluation of consumer health reports [Smith and Fellbaum, 2004]. MEN was developed with the help of formal architecture of the Princeton Word- Net [Fellbaum, 1998]. MFN serves to assist the non-expert group in providing a better understanding of basic medical information. MBN identifies beliefs about the medical phe- nomenon. Their primary motivation was to develop a net- work of medical information retrieval systems with visuali- zation effect. The domain-specific knowledge and the abovementioned features are essential to improve the effi- ciency of the sentiment extraction system [Shukla et al., 2015]. So, these approaches were not able to provide ade- quate accuracy due to the lack of knowledge involvement 3 Model Design from the domain experts. Hence, to overcome the mentioned The knowledge-based sentiment lexicon is crucial to design problem, the researchers introduced supervised machine a context based sentiment extraction system. The medical learning approaches [Smith and Lee, 2012]. Standard Na- concepts and their linguistic features are extracted from the ïveBayes, Multinomial NaïveBayes and Support Vector domain-specific sentiment lexicon. To overcome the prob- Machine (SVM) supervised classifiers were applied with lem of experts’ availability, we have formulated WME2.0 unigram, bigram, Parts Of Speech (POS) and negation fea- lexicon with a hybrid approach. It adds an extra dimension tures under the machine learning framework. The research- ers have also used hybrid approaches to improve the accura- 1 cy of the medical context based sentiment extraction sys- http://sentiwordnet.isti.cnr.it/ 2 http://sentic.net/ tems. One of the hybrid approaches was developed with the 3 https://www.cs.uic.edu/liub/FBS/sentiment-analysis.html 36 for improving the accuracy of the extracted medical context Data Formatting: Data formatting has been applied to rep- sentiment. The proposed hybrid approach is the combination resent the structured form of the extracted medical concepts of linguistic and machine learning approach. The approach [Hussain et al., 2011]. The extracted structured (vector) consists of two phases namely pre-processing and learning concepts have been forwarded to the learning phase along phase. Figure 1 shows the architecture of the proposed ap- with their features. The concept structure is represented as proach (model). follows: 3.1 Pre-processing phase and knowledge-based information. The structured form of the concepts is essential in identifying the important medical 3.2 Learning phase concepts from the context. Followed by the pre-processing phase, the hybrid approach has been introduced in the learning phase to build the con- Figure 2: Flowchart of Preprocessing Phase textual sentiment extraction system. Linguistic and machine learning has been combined to form the hybrid approach. The linguistic approach with WME2.0 knowledge base lexi- con helps to identify the hidden rules. These rules are able to extract the concept sentiment and their polarity. The ex- tracted linguistic concept features (rules) were fed to the supervised machine learning classifiers to evaluate the accu- racy of the model. The linguistic approach provides a sup- port to handle the negation effect of the context and help to identify the appropriate sentiment of the context [Huang and Lowe, 2007]. The learning phase is illustrated as follows: Step 1: Identify the polarity score and sense of each concept (medical and non-medical) of the context. Step 2: Linguistic approach-based negation words (concept) handling. In this concern, to represent the structured medical concepts, Step 3: Calculate the overall polarity of the context. the required steps are data extraction, cleansing and format- Context polarity = ∑ Polarityc ting. The research community provided various linguistic Where, c = number of concepts in the context and Polarityc resources such as open source data preprocessing tools (viz. indicates the polarity score of each concept. NLTK, stemming etc.) [Na et al., 2012]. The following Step 4: The context sentiment has been evaluated using steps illustrate the basic operations of the pre-processing Context polarity score. phase: 4 Discussion and Evaluation Data Extraction: The medical concepts extraction from a given context is the primary task of this step. WME2.0 helps The context related medical concepts and their semantic to extract the medical concepts and their linguistic and features (extraction polarity, semantics and sense) are re- sense-based features from the context. Moreover, the non- quired to identify the sentiment of the medical context medical concepts and their sense identification are also es- [Sarker et al., 2011]. In the process, the statistical and lin- sential to identify the sentiment of the context. The non- guistic features based medical sentiment lexicons were fac- medical concepts the senses have been extracted using Sen- ing difficulties due to the unstructured nature of the corpus. tiWordNet and SenticNet lexicons [Cambria et al., 2014; So, the researchers tried to build an intelligent automated Cambria et al., 2013; Esuli and Sebastiani, 2006]. sentiment extraction system in the Bio-NLP domain [Shukla et al., 2015; Sohn et al., 2012]. The system helps to extract Data Cleansing: Data cleansing step is responsible to re- the structured knowledge-based information with a proper move the context related stop-words and stemmed the con- sentiment of the context. WordNet of Medical Event cept words. The classification of medical and nonmedical (WME2.0) was introduced to identify the medical concept concepts and identification of negation words (like no, not, and their sense-based features. The WME2.0 lexicon able to never etc.) are also taken care of by data cleansing step extract the medical concepts and their POS, semantics, [Huang and Lowe, 2007]. gloss, affinity score, gravity score, polarity score and sense. On the top of WME2.0 lexicon, the hybrid approach has been applied to extract the context level sentiment for the 37 proposed model. The model is based on two phases namely 4.1 Evaluation Process pre-processing and learning phase. The pre-processing To develop and measure the accuracy of the context level phase has considered the concept extraction (medical and sentiment extraction system, the data has been collected non-medical concept), concept cleansing (concept stemming from the open source resource4. We have extracted 7042 and stop-words removing) and concept formatting steps. The learn- posed sentiment extraction system. The context sentiment ing phase identified the sentiment using the linguistic and extraction system has provided 3265 number of the positive machine learning approaches on the pre-processing step and 3777 number of the negative sentiments of the contexts. driven data. The concept linguistic features and knowledge To evaluate the extracted context sentiment, the linguistic based WME sentiment resource help to extract the overall features (number of negation word, context polarity score context sentiment and polarity score. The linguistic ap- and sense) were fed to the NaïveBayes and support vector proach provides a support to handle the negation and identi- based SMO supervised machine learning classifiers under fies the correct sense of the context. The medical context the WEKA5 tool. The extracted 7042 number of context “No lung lesion found” has been evaluated as “positive” data has been represented as 4900 number of training and sentiment after handling the negation. The system first ex- the remaining 2142 number of test dataset. The system’s tracts the concepts and their sense as “no (-ve)”, “lung (neu- accuracy was measured as F-Measure with four types of tral)”, “lesion (-ve)” and “found (+ve)” using WME2.0 re- models like, Use training set, Supplied test set, Cross- source. The linguistic-based negation handling approach has validation Folds 10 and Percentage split %66. Table 1 been applied on the extracted sense and identify the overall shows the F-Measures of these modes for the NaïveBayes context sense as “positive”. In the learning phase, the hybrid and support vector based SMO supervised classifiers. The linguistic and machine learning based hybrid approach pro- approach has been introduced to extract and measure the vides the accuracy score nearly 81% for the medical context accuracy of the context sentiment. The linguistic approach sentiment extraction model. involves knowledge-based medical concept mapping with WME2.0 lexicon. Further, the NaïveBayes and Sequential Table 1: F-Measure of Supervised classifiers minimal optimization (SMO) support vector based super- Model NaïveBayes SMO vised machine learning approaches have been employed for Use training set 0.868 0.890 evaluating the accuracy of the model. Figure 3 and Figure 4 Supplied test set 0.815 0.815 describe the positive and negative contexts with respect to Cross-validation Folds 10 0.864 0.867 the sentiment extraction process, respectively. Percentage split %66 0.873 0.879 Figure 3: Positive Sentiment extraction Figure 4: Negative Sentiment extraction 4 http://www.medicinenet.com/ 5 http://weka.wikispaces.com/ 38 5 Conclusion and Future scope City, Mexico, November 24-30, 2013, Proceedings, Part II, pages 478–483, 2013. Sentiment or opinion analysis is important to extract the contextual information from the medical context under NLP [Cambria, 2016] Erik Cambria. Affective computing and domain. The context sentiment helps to identify the sentiment analysis. IEEE Intelligent Systems, 31(2):102– knowledge based information and proper utilization of the 107, 2016. context. The paper has reported a hybrid approach based [Cambria et al., 2015] Erik Cambria, Jie Fu, Federica Bisio, context sentiment extraction model with two phases. The and Soujanya Poria. Affectivespace 2: Enabling affective phases are preprocessing (important medical keywords ex- intuition for concept-level sentiment analysis. In Pro- traction) and learning (respective sentiment identification). ceedings of the Twenty-Ninth AAAI Conference on Arti- In the process, the linguistic and machine learning combined ficial Intelligence, January 25-30, 2015, Austin, Texas, hybrid approach has been applied on the top of WordNet of USA, pages 508–514, 2015. Medical Event (WME2.0) lexicon to extract the medical [Cambria et al., 2014] Erik Cambria, Daniel Olsher, and concepts in order to identify the sentiment of the medical Dheeraj Rajagopal. Senticnet 3: A common and com- context. The medical concept polarity score and their related mon-sense knowledge base for cognition-driven senti- sense helps to identify the medical context sentiment [Cam- ment analysis. In AAAI Conference on Artificial Intelli- bria, 2013] and [Cambria et al., 2015]. WME2.0 lexicon gence, 2014. driven medical concepts affinity score and their semantic features are crucial in building the proposed model. The [Cambria et al., 2013] Erik Cambria, Bjrn Schuller, Yun- medical concept semantics, polarity score and affinity score qing Xia, and Catherine Havasi. New avenues in opinion helps to identify the medical concept sentiment with polarity mining and sentiment analysis. IEEE Intelligent Systems, score. The hybrid approach provides nearly 81% accuracy 28(2):15–21, 2013. for the proposed context sentiment extraction system. [Hussain et al., 2011] Hussain A Cambria E and Eckl C. Hence, the future research will focus to develop some prac- Bridging the gap between structured and unstructured tical applications relating to the current work as medical health- care data through semantics and sentics. In Pro- annotation and context summarization system. These sys- ceedings of ACM WebSci, Koblenz, 2011. tems will provide the support to the expert and non-expert [Elkin et al., 2005] Peter L. Elkin, Steven H. Brown, Brent groups in their respective applications. A. Bauer, Casey S. Husser, William Carruth, Larry R. Bergstrom and Dietlind L. Wahner-Roedler. A con- References trolled trial of automated classification of negation from [Mondal et al., 2016] Anupam Mondal, Dipankar Das, Erik clinical notes. BMC Medical Informatics and Decision Cambria and Sivaji Bandyopadhyay. WME: Sense, po- Making, 5(1):1–7, 2005. larity and affinity based concept resource for medical [Esuli and Sebastiani, 2006] Andrea Esuli and Fabrizio Se- events. In Proceedings of the Eighth Global WordNet bastiani. Sentiwordnet: A publicly available lexical re- Conference, pages 242–246, 2016. source for opinion mining. In Proceedings of the 5th [Birks et al., 2009] Yvonne Birks, Jean McKendree, and Ian Conference on Language Resources and Evaluation Watt. Emotional intelligence and perceived stress in (LREC06), pages 417–422, 2006. healthcare students: a multi-institutional, multi- [Fellbaum, 1998] Christiane Fellbaum. WordNet: an elec- professional survey. BMC Medical Education, 9(1):1–8, tronic lexical database. MIT Press, 1998. 2009. [Huang and Lowe, 2007] Yang Huang and Henry J. Lowe. [Boytcheva et al., 2005] Svetla Boytcheva, Albena Strup- A novel hybrid approach to automated negation detec- chanska, Elena Paskaleva, Dimitar Tcharaktchiev, and tion in clinical radiology reports. Journal of the Ameri- Dame Gruev Str. Some aspects of negation processing in can Medical Informatics Association: JAMIA, electronic health records. In Proceedings of Internation- 14(3):304–311, May 2007. al Workshop Language and Speech Infrastructure for In- [Mondal et al., 2015] Anupam Mondal, Iti Chaturvedi, formation Access in the Balkan Countries. Pages 1—8, Dipankar Das, Rajiv Bajpai, and Sivaji Bandyopadhyay. 2005. Lexical resource for medical events: A polarity based [Cambria et al., 2010] E. Cambria, A. Hussain, T. Durrani, approach. In IEEE ICDM Workshops, pages 1302–1309. C. Havasi, C. Eckl, and J. Munro. Sentic computing for IEEE, 2015. patient centered applications. In IEEE 10th International [Na et al., 2012] Jin-Cheon Na, Wai Yan Min Kyaing, Conference on Signal Processing Proceedings, pages Christopher SG Khoo, Schubert Foo, Yun-Ke Chang, 1279–1282, Oct 2010. and Yin-Leng Theng. Sentiment classification of drug [Cambria, 2013] Erik Cambria. An introduction to concept- reviews using a rule-based linguistic approach. In The level sentiment analysis. In Advances in Soft Computing outreach of digital libraries: a globalized resource net- and Its Applications - 12th Mexican International Con- work, pages 189–198. Springer, 2012. ference on Artificial Intelligence, MICAI 2013, Mexico 39 [Niu et al., 2005] Yun Niu, Xiaodan Zhu, Jianhua Li, and Graeme Hirst. Analysis of polarity information in medi- cal text. In Proceedings of the American Medical Infor- matics Association Annual Symposium, 2005. [Sarker et al., 2011] Abeed Sarker, Diego Moll´a-Aliod, C´ecile Paris, et al. Outcome polarity identification of medical papers. Melbourne: Australian Language Tech- nology Association. 2011. [Shukla et al., 2015] Ravi Shankar Shukla, Kamendra Singh Yadav, Syed Tarif Abbas Rizvi, and Faisal Haseen. An Efficient Mining of Biomedical Data from Hypertext Documents via NLP. In Proceedings of the 3rd Interna- tional Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014: Volume 1, pag- es 651–658. Springer International Publishing, Cham, 2015. [Smith and Fellbaum, 2004] Barry Smith and Christiane Fellbaum. Medical wordnet: A new methodology for the construction and validation of information resources for consumer health. In Proceedings of COLING, 2004. [Smith and Lee, 2012] Phillip Smith and Mark Lee. Cross- discourse development of supervised sentiment analysis in the clinical domain. In Proceedings of the 3rd Work- shop in Computational Approaches to Subjectivity and Sentiment Analysis, WASSA ’12, Association for Compu- tational Linguistics, pages 79–83, Stroudsburg, PA, USA, 2012. [Sohn et al., 2012] Sunghwan Sohn, Manabu Torii, Ding- cheng Li, Stephen Wu, Hongfang Liu, and Avishwar Wagholikar. A Hybrid Approach to Sentiment Sentence Classification in Suicide Notes. In Biomedical Informat- ics Insights, pages 43+, January 2012. [Swaminathan et al., 2010] Rajesh Swaminathan, Abhishek Sharma, and Hui Yang. Opinion mining for biomedical text data: Feature space design and feature selection. In The Nineth International Workshop on Data Mining in Bioinformatics, BIOKDD, 2010. [Szarvas et al., 2008] Gy¨orgy Szarvas, Veronika Vincze, Rich´ard Farkas, and J´anos Csirik. The bioscope corpus: annotation for negation, uncertainty and their scope in biomedical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Pro- cessing, Association for Computational Linguistics, pag- es 38–45, Columbus, Ohio, June 2008. [Taboada et al., 2011] Maite Taboada, Milan Tofiloski, Jul- ian Brooke, Kimberly Voll, and Manfred Stede. Lexi- con-based methods for sentiment analysis. Journal of Computational linguistics, volume 37, number 2, pages 267-307, publisher MIT Press, 2011. [Villena-Romn et al., 2011] Julio Villena-Romn, Sonia Col- lada-Prez, Sara Lana-Serrano, and Jos Carlos Gonzlez Cristbal. Hybrid approach combining machine learning and a rule-based expert system for text categorization. In R. Charles Murray and Philip M. McCarthy, editors, FLAIRS Conference. AAAI Press, 2011. 40