=Paper=
{{Paper
|id=Vol-1650/smbm2016Skeppstedt
|storemode=property
|title=Marker Words for Negation and Speculation in Health Records and Consumer Reviews
|pdfUrl=https://ceur-ws.org/Vol-1650/smbm2016Skeppstedt.pdf
|volume=Vol-1650
|authors=Maria Skeppstedt,Carita Paradis,Andreas Kerren
|dblpUrl=https://dblp.org/rec/conf/smbm/SkeppstedtPK16
}}
==Marker Words for Negation and Speculation in Health Records and Consumer Reviews==
Marker words for negation and speculation in health records and consumer reviews Maria Skeppstedt1,2 Carita Paradis3 Andreas Kerren2 1 Gavagai AB, Stockholm, Sweden maria@gavagai.se 2 Computer Science Department, Linnaeus University, Växjö, Sweden andreas.kerren@lnu.se 3 Centre for Languages and Literature, Lund University, Lund, Sweden carita.paradis@englund.lu.se Abstract tion (Vincze et al., 2008). The guidelines used for the BioScope corpus have later, with only a Conditional random fields were trained few modifications, been used for annotating con- to detect marker words for negation and sumer review texts. A qualitative analysis of the speculation in two corpora belonging to difference between the medical genres of the Bio- two very different domains: clinical text Scope corpus and consumer review texts has pre- and consumer review text. For the corpus viously been carried in order to adapt the guide- of clinical text, marker words for specula- lines for the genre of review texts (Konstantinova tion and negation were detected with re- and de Sousa, 2011). To the best of our know- sults in line with previously reported inter- ledge, there are, however, no previous studies in annotator agreement scores. This was also which the same machine learning algorithm is ap- the case for speculation markers in the plied to both corpora and the results are compared. consumer review corpus, while detection of negation markers was unsuccessful in 2 Background this genre. Also a setup in which mod- els were trained on markers in consumer There are other medical corpora annotated with reviews, and applied on the clinical text the same guidelines as the BioScope corpus genre, yielded low results. This shows (Vincze et al., 2008), e.g., a drug-drug interac- that neither the trained models, nor the tion corpus (Bokharaeian et al., 2014). There choice of appropriate machine learning al- are also medical corpora annotated according gorithms and features, were transferable to other guidelines, e.g., guidelines that include across the two text genres. more fine-grained categories, such as weaker or stronger speculation/uncertainty (Velupillai, 1 Introduction 2012), or whether a clinical finding is condi- When health professionals document patient sta- tionally or hypothetically present in the patient tus, they often record common symptoms that the (Uzuner et al., 2011). Large annotated corpora are patient is not showing, or reason about possible di- often constructed on English medical text, e.g., the agnoses. Clinical texts, therefore, contain a large i2b2/VA challenge on concepts, assertions, and re- amount of negation and speculation (Velupillai et lations corpus, but negation and speculation has al., 2011). also been annotated in corpora with clinical text Negations and speculations are also expressed written in, e.g., Swedish (Velupillai, 2012) and in consumer review texts, e.g., when the reviewed Japanese (Aramaki et al., 2014). artefact lacks an expected feature, or when review- Examples of non-medical corpora are the pre- ers are uncertain of their opinion. Previous re- viously mentioned corpus of consumer reviews search shows that the proportion of sentences con- (Konstantinova and de Sousa, 2011), and literary taining negation and speculation is even larger in texts annotated for negation in the *SEM shared consumer review texts that in clinical texts (Vincze task (Morante and Blanco, 2012). et al., 2008; Konstantinova et al., 2012). Negations and speculations are often annotated The BioScope corpus was one of the first clin- in two steps. First, marker words (often also re- ical corpora annotated for negation and specula- ferred to as cue words or keywords) for nega- tion/speculation are annotated, and then either the this study, the subcorpus containing clinical text scope of text that the marker words affects is an- was used, which consists of 6,400 sentences of notated, or whether specific focus words occurring which 14% contains negation and 13% contains in the text are affected by the marker words. Focus speculation. The pairwise agreement rates for words could, for instance, be clinical findings that the three annotators involved in the project were are mentioned in the same sentence as the marker 91/95/96 for annotating marker words for negation words. Automatic detection of negation and spec- and 84/90/92 for marker words for speculation. ulation is typically divided into two subtasks cor- The corpus of consumer reviews was a previ- responding to the two annotation steps. That is, ously complied corpus, the SFU Review corpus, first the marker words are detected and, thereafter, to which annotations of negation and speculation the task of determining the scope or classifying the were added. The corpus contains consumer gener- focus words is carried out. ated reviews of books, movies, music, cars, com- In this study, the first of the two subtasks of puters, cookware and hotels (Taboada and Grieve, negation/speculation detection is addressed, i.e., 2004; Taboada et al., 2006). The corpus con- the detection of marker words for negation and sists of 17,000 sentences, of which 18% was anno- speculation. This task is typically addressed us- tated as containing negation and 22% as contain- ing two main approaches, either a vocabulary of ing speculation. 10% of the corpus was doubly an- negation/speculation markers is compiled and to- notated to measure inter-annotator agreement, re- kens in the text are compared to this vocabulary in sulting in an F-score and Kappa score of 92 for order to determine whether they are marker words negation markers and 89 for speculation markers. (Chapman et al., 2001; Ahltorp et al., 2014), or There are previous studies on the detection of alternatively a machine learning model is trained. speculation and negation markers in these two cor- pora. A perfect precision and a recall of 0.98 3 Materials were obtained, when training an IGTree classifier Two English corpora were used in the experi- to detect negation markers on the full paper sub- ments, the Bioscope corpus (Vincze et al., 2008) corpus of the BioScope corpus and evaluating it on and the SFU Review corpus annotated for nega- the clinical sub-corpus (Morante and Daelemans, tion and speculation (Konstantinova et al., 2012). 2009b). Similar results for detecting negation As previously mentioned, the annotation guide- markers in the clinical sub-corpus were achieved lines for the SFU Review corpus were an adap- by a vocabulary matching system. When using tion of the guidelines for the BiosScope corpus, the same set-up for detecting speculation markers, and they were, therefore, very similar. In both cor- i.e., training on the paper sub-corpus and evaluat- pora, marker words expressing negation and spec- ing on the clinical, a precision of 0.88 and a recall ulation were annotated, as well as their scope. The of 0.27 were achieved (Morante and Daelemans, general principle for the length of text to anno- 2009a). For these experiments, the token to be tate as marker words was to annotate the minimal classified, as well as its immediate neighbouring unit of text that still expresses negation or spec- tokens were used as features. When instead train- ulation. The definition of negation used for the ing as well as evaluating on the clinical sub-corpus task was “[...] the implication of the non-existence (a conditional random fields model with tokens as of something”, while speculation was defined as features), a precision of 0.99 and a recall of 0.87 “[...] the possible existence of a thing, i.e. neither were achieved for detecting speculation, while a its existence nor its non-existence is unequivocally rule-base vocabulary matching system achieved a stated [...]”. Marker words could either be individ- precision of 0.95 and a recall of 0.96 on this task ual words that express negation or speculation on (Agarwal and Yu, 2010). Examples of other re- their own, e.g., “This {may} {indicate}..”, or com- sults reported are a precision/recall of 0.97/0.98 plex expressions containing several words that do for negation markers and 0.96/0.93 for specula- not convey negation or speculation on their own, tion markers (Cruz Dı́az et al., 2012), using a C4.5 e.g., “This {raises the question of}...”. classifier and a support vector machine. The BioScope corpus consists of three sub- There is also previous research on the detec- corpora, containing clinical text, biological full tion of which tokens that constitute negation and papers and biological scientific abstracts. For speculation markers in the SFU Review corpus (Cruz et al., 2015). Experiments were conducted the feature set, as the models were to be trained on in which 10-fold cross-validation was applied on a limited amount of data, features were restricted the entire corpus, and a feature set that included to the token that was to be classified, and, in addi- the token and its closest neighbours was used. For tion, a minimum of two occurrences of a token in the most successful machine learning algorithm the training data was required for it to be included. (a cost-sensitive support vector machine), a pre- As linear conditional random fields were used, the cision of 0.80 and a recall of 0.98 were obtained classification of a token was dependent on the clas- for negation and a precision of 0.91 and a recall sification of the two neighbouring tokens (Sutton of 0.94 were obtained for speculation. For the and McCallum, 2006), making it possible to detect two other evaluated algorithms (Naive Bayes and multi-word markers. a support vector machine with a radial basic func- For all setups, the models were trained with tion kernel), much lower and slightly lower re- an increasingly larger size of training data, from sults, respectively, were obtained. Both of these 600 training instances to 3,000. In each itera- two lower-performing models had problems han- tion, 200 new training instances were randomly se- dling multi-word markers for negation that in- lected for inclusion in the training data. The same cluded n’t or not, and results for these two mod- experiment was repeated four times, each time els were improved by a simple rule-based post- with a new, randomly selected, subset of held- processing algorithm specifically designed to han- out data to use for evaluation in setups i) and ii), dle these cases. and (for all experiments) new random selections of training instances. Precision, recall and F-score 4 Experiments for recognising segments that were classified as negation- or speculation markers were measured Experiments consisted of training machine learn- with NLTK’s ChunkScore class (Bird, 2002). ing models to recognise markers for negation and speculation and, thereafter, evaluate these mod- 5 Results and discussion els. Three setups were used: i) models trained For detecting speculation markers in the SFU Re- on a subset of the BioScope corpus and evaluated view corpus, and for detecting both speculation on another subset of the same corpus, ii) models and negation markers in the BioScope corpus trained on a subset of the SFU Review corpus and when trained on text of the same genre, the method evaluated on another subset of this corpus, and fi- was relatively successful (Figure 1), achieving re- nally iii) models trained on the SFU Review cor- sults in line with the inter-annotator agreement.1 pus and evaluated on the BioScope corpus. The For detecting negation, the increase in training rationale for performing the last experiment was data size did not affect these results, while the gen- the difficulty that is often associated with getting eral trend for speculation was an improvement of access to large amounts of clinical text, due to the results with more training samples, although re- sensitive content of text belonging to this genre. sults remained slightly unstable. If it would be possible to successfully apply a For detecting negation in the SFU Review cor- model trained on non-clinical text on the clinical pus, on the other hand, results were much lower text genre, this might be a possible solution in than the measured agreement figures. Results cases when the amount of available clinical data were consistently low for all four folds (F-scores is scarce. 0.70/0.75/0.76/0.74 for 3,000 training instances), The text segments annotated as negation- and and the F-score decreased with a larger training speculation markers were coded according to the data set due to a decrease in precision, and a recall BIO-format, i.e., a token could be the beginning that remained low. It could be ruled out that the of, inside or outside of a marker segment. The ap- low results were due to the relatively small train- proach of structured prediction was taken, and the ing data size, since an additional model, trained on PyStruct package was used (Müller and Behnke, 8,000 samples, gave even lower results (an F-score 2014) to train a linear conditional random fields of 0.62). Multi-token negation markers including model, using the OneSlackSSVM class. Default 1 parameters were used (which included a regular- Previous machine learning results have typically been achieved using a larger training set, and, therefore, a com- isation parameter of 1) and a maximum of 100 parison to the agreement figures was carried out, instead of a passes over the dataset to find constraints. To limit comparison to previous results. 1 Negation 1 1 Bioscope 0.9 0.9 0.9 SFU Precision F-score 0.8 Recall 0.8 0.8 SFU/Bioscope 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 1000 2000 3000 1000 2000 3000 1000 2000 3000 Training samples Speculation 1 1 1 0.9 0.9 0.9 Precision F-score Recall 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 1000 2000 3000 1000 2000 3000 1000 2000 3000 Training samples Figure 1: Average results for different number of training samples. SFU/BioScope is the model trained on the SFU Review corpus and applied on the BioScope corpus. n’t or not were, however, very common among ining incorrectly classified segments showed that false negatives and positives, and it is therefore false negatives were not limited to marker words likely that the low results for this category were that might be more typical to the reasoning style due to the inability of the trained model to detect of the clinical genre, e.g., evaluate, suggest, indi- multi-token negations, i.e., the same problem that cate, compatible, consistent and question, but also arose for two of the models trained by Cruz et included general expressions such as possible and al. (2015). This might, for instance, be an effect probable. of not including the neighbouring words as fea- Results also show that not even lessons learnt tures. The models were, however, in general able for the choice of appropriate machine learning al- to detect multi-word marker words, e.g., the fol- gorithms and features are transferable across gen- lowing complex speculation markers I-’d-suggest, res, as the techniques for detecting negation that would-think, can-either, might-expect, would-feel. was shown successful for the BioScope corpus There were also a number of complex expres- produced low results on the SFU Review corpus. sions among the false positives for speculation, Future work includes research on whether these that might be considered as belonging to this class, findings also hold for the scope of the markers. despite not being annotated as such. Examples are can-hope, can-either, to-think. 6 Conclusion Also the setting of training the model on the SFU Review corpus and evaluating it on the Bio- In the BioScope corpus, speculation and negation Scope corpus gave low results for negation as well markers were detected with results close to previ- as for speculation. It can, however, be observed ously reported annotator agreement scores. This that for speculation markers, this strategy was was also the case for speculation markers in the more successful than the previously explored strat- SFU Review corpus, while detection of negation egy of training a model on biomedical article texts markers was unsuccessful in this genre. To train and applying it on the clinical text genre (Morante the model on consumer reviews and apply it on and Daelemans, 2009a). There might thus be a clinical text also yielded low results, showing that larger similarity between how speculation is ex- neither the trained models, nor the choice of ap- pressed in consumer reviews and in clinical texts, propriate algorithms and features, were transfer- than between clinical and biomedical texts. Exam- able across the two text genres. Acknowledgements Research Workshop associated with The 8th Inter- national Conference on Recent Advances in Natural This work was funded by the StaViCTA project, Language Processing, RANLP 2011, 13 September, framework grant “the Digitized Society – Past, 2011, Hissar, Bulgaria, pages 139–144. Present, and Future” with No. 2012-5659 from the Natalia Konstantinova, Sheila C.M. de Sousa, Noa P. Swedish Research Council (Vetenskapsrådet). Cruz, Manuel J. Maña, Maite Taboada, and Rus- lan Mitkov. 2012. A review corpus annotated for negation, speculation and their scope. In Nico- References letta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğanur, Bente Maegaard, Joseph Shashank Agarwal and Hong Yu. 2010. Detecting Mariani, Jan Odijk, and Stelios Piperidis, editors, hedge cues and their scope in biomedical text with Proceedings of the Eight International Conference conditional random fields. Journal of Biomedical on Language Resources and Evaluation (LREC), Informatics, 43(6):953 – 961. pages 3190–3195, Istanbul, Turkey. European Lan- guage Resources Association (ELRA). Magnus Ahltorp, Hideyuki Tanushi, Shiho Kita- jima, Maria Skeppstedt, Rafal Rzepka, and Kenji Roser Morante and Eduardo Blanco. 2012. *sem 2012 Araki. 2014. HokuMed in NTCIR-11 MedNLP- shared task: resolving the scope and focus of nega- 2:Automatic extraction of medical complaints from tion. In Proceedings of the First Joint Conference on Japanese health records using machine learning and Lexical and Computational Semantics - Volume 1: rule-based methods. In Proceedings of NTCIR-11, Proceedings of the main conference and the shared pages 158–162. task, and Volume 2: Proceedings of the Sixth In- ternational Workshop on Semantic Evaluation, Se- Eiji Aramaki, Mizuki Morita, Yoshinobu Kano, and mEval ’12, pages 265—274. Tomoko Ohkuma. 2014. Overview of the NTCIR- 11 MedNLP-2 Task. In Proceedings of NTCIR-11, Roser Morante and Walter Daelemans. 2009a. Learn- pages 147–154. ing the scope of hedge cues in biomedical texts. In Proceedings of the Workshop on Current Trends in Steven Bird. 2002. Nltk: The natural language toolkit. Biomedical Natural Language Processing, BioNLP In Proceedings of the ACL Workshop on Effective ’09, pages 28–36, Stroudsburg, PA, USA. Associa- Tools and Methodologies for Teaching Natural Lan- tion for Computational Linguistics. guage Processing and Computational Linguistics, Stroudsburg, PA, USA. Association for Computa- Roser Morante and Walter Daelemans. 2009b. A met- tional Linguistics. alearning approach to processing the scope of nega- tion. In CoNLL ’09: Proceedings of the Thirteenth Behrouz Bokharaeian, Alberto Diaz, Mariana Neves, Conference on Computational Natural Language and Virginia Francisco. 2014. Exploring nega- Learning, pages 21–29, Morristown, NJ, USA. As- tion annotations in the drugddi corpus. In Fourth sociation for Computational Linguistics. Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (BIOTxtM Andreas C. Müller and Sven Behnke. 2014. pystruct - 2014). learning structured prediction in python. Journal of Machine Learning Research, 15:2055–2060. Wendy W. Chapman, Will Bridewell, Paul Hanbury, Charles. Sutton and Andrew McCallum. 2006. An in- Gregory F. Cooper, and Bruce G. Buchanan. 2001. troduction to conditional random fields for relational A simple algorithm for identifying negated findings learning. In Lise Getoor and Ben Taskar, editors, and diseases in discharge summaries. J Biomed In- Introduction to Statistical Relational Learning. MIT form, 34(5):301–310, Oct. Press. Noa P. Cruz, Maite Taboada, and Ruslan Mitkov. 2015. Maite Taboada and Jack Grieve. 2004. Analyzing ap- A machine-learning approach to negation and spec- praisal automatically. In Proceedings of the AAAI ulation detection for sentiment analysis. Journal of Spring Symposium on Exploring Attitude and Affect the Association for Information Science and Tech- in Text: Theories and Applications, pages 158–161. nology, pages 526–558. Maite Taboada, Caroline Anthony, and Kimberly Voll. Noa P Cruz Dı́az, Manuel J Maña López, Jacinto Mata 2006. Methods for creating semantic orientation Vázquez, and Victoria Pachón Álvarez. 2012. A dictionaries. In Proceedings of 5th International machine-learning approach to negation and specula- Conference on Language Resources and Evalua- tion detection in clinical texts. Journal of the Amer- tion (LREC), pages 427–432, Genoa, Italy. Euro- ican society for information science and technology, pean Language Resources Association (ELRA). 63(7):1398–1410. Özlem. Uzuner, Brett R. South, Shuying Shen, and Natalia Konstantinova and Sheila C. M. de Sousa. Scott L. DuVall. 2011. 2010 i2b2/va challenge on 2011. Annotating negation and speculation: the case concepts, assertions, and relations in clinical text. J of the review domain. In Proceedings of the Student Am Med Inform Assoc, 18(5):552–556. Sumithra Velupillai, Hercules Dalianis, and Maria Kvist. 2011. Factuality Levels of Diagnoses in Swedish Clinical Text. In A. Moen, S. K. Ander- sen, J. Aarts, and P. Hurlen, editors, Proc. XXIII In- ternational Conference of the European Federation for Medical Informatics (User Centred Networked Health Care), pages 559–563, Oslo, August. IOS Press. Sumithra Velupillai. 2012. Shades of Certainty – Annotation and Classification of Swedish Medical Records. Doctoral thesis, Department of Computer and Systems Sciences, Stockholm University, Stock- holm, Sweden, April. Veronika Vincze, György Szarvas, Richárd Farkas, György Móra, and János Csirik. 2008. The Bio- Scope Corpus: Biomedical texts annotated for un- certainty, negation and their scopes. BMC Bioinfor- matics, 9 (Suppl 11):S9.