=Paper= {{Paper |id=Vol-1650/smbm2016Skeppstedt |storemode=property |title=Marker Words for Negation and Speculation in Health Records and Consumer Reviews |pdfUrl=https://ceur-ws.org/Vol-1650/smbm2016Skeppstedt.pdf |volume=Vol-1650 |authors=Maria Skeppstedt,Carita Paradis,Andreas Kerren |dblpUrl=https://dblp.org/rec/conf/smbm/SkeppstedtPK16 }} ==Marker Words for Negation and Speculation in Health Records and Consumer Reviews== https://ceur-ws.org/Vol-1650/smbm2016Skeppstedt.pdf
                      Marker words for negation and speculation
                       in health records and consumer reviews
                     Maria Skeppstedt1,2 Carita Paradis3 Andreas Kerren2
                                1
                                  Gavagai AB, Stockholm, Sweden
                                     maria@gavagai.se
              2
                Computer Science Department, Linnaeus University, Växjö, Sweden
                                  andreas.kerren@lnu.se
              3
                Centre for Languages and Literature, Lund University, Lund, Sweden
                             carita.paradis@englund.lu.se

                     Abstract                          tion (Vincze et al., 2008). The guidelines used
                                                       for the BioScope corpus have later, with only a
    Conditional random fields were trained             few modifications, been used for annotating con-
    to detect marker words for negation and            sumer review texts. A qualitative analysis of the
    speculation in two corpora belonging to            difference between the medical genres of the Bio-
    two very different domains: clinical text          Scope corpus and consumer review texts has pre-
    and consumer review text. For the corpus           viously been carried in order to adapt the guide-
    of clinical text, marker words for specula-        lines for the genre of review texts (Konstantinova
    tion and negation were detected with re-           and de Sousa, 2011). To the best of our know-
    sults in line with previously reported inter-      ledge, there are, however, no previous studies in
    annotator agreement scores. This was also          which the same machine learning algorithm is ap-
    the case for speculation markers in the            plied to both corpora and the results are compared.
    consumer review corpus, while detection
    of negation markers was unsuccessful in            2   Background
    this genre. Also a setup in which mod-
    els were trained on markers in consumer            There are other medical corpora annotated with
    reviews, and applied on the clinical text          the same guidelines as the BioScope corpus
    genre, yielded low results. This shows             (Vincze et al., 2008), e.g., a drug-drug interac-
    that neither the trained models, nor the           tion corpus (Bokharaeian et al., 2014). There
    choice of appropriate machine learning al-         are also medical corpora annotated according
    gorithms and features, were transferable           to other guidelines, e.g., guidelines that include
    across the two text genres.                        more fine-grained categories, such as weaker
                                                       or stronger speculation/uncertainty (Velupillai,
1   Introduction                                       2012), or whether a clinical finding is condi-
When health professionals document patient sta-        tionally or hypothetically present in the patient
tus, they often record common symptoms that the        (Uzuner et al., 2011). Large annotated corpora are
patient is not showing, or reason about possible di-   often constructed on English medical text, e.g., the
agnoses. Clinical texts, therefore, contain a large    i2b2/VA challenge on concepts, assertions, and re-
amount of negation and speculation (Velupillai et      lations corpus, but negation and speculation has
al., 2011).                                            also been annotated in corpora with clinical text
   Negations and speculations are also expressed       written in, e.g., Swedish (Velupillai, 2012) and
in consumer review texts, e.g., when the reviewed      Japanese (Aramaki et al., 2014).
artefact lacks an expected feature, or when review-       Examples of non-medical corpora are the pre-
ers are uncertain of their opinion. Previous re-       viously mentioned corpus of consumer reviews
search shows that the proportion of sentences con-     (Konstantinova and de Sousa, 2011), and literary
taining negation and speculation is even larger in     texts annotated for negation in the *SEM shared
consumer review texts that in clinical texts (Vincze   task (Morante and Blanco, 2012).
et al., 2008; Konstantinova et al., 2012).                Negations and speculations are often annotated
   The BioScope corpus was one of the first clin-      in two steps. First, marker words (often also re-
ical corpora annotated for negation and specula-       ferred to as cue words or keywords) for nega-
tion/speculation are annotated, and then either the      this study, the subcorpus containing clinical text
scope of text that the marker words affects is an-       was used, which consists of 6,400 sentences of
notated, or whether specific focus words occurring       which 14% contains negation and 13% contains
in the text are affected by the marker words. Focus      speculation. The pairwise agreement rates for
words could, for instance, be clinical findings that     the three annotators involved in the project were
are mentioned in the same sentence as the marker         91/95/96 for annotating marker words for negation
words. Automatic detection of negation and spec-         and 84/90/92 for marker words for speculation.
ulation is typically divided into two subtasks cor-         The corpus of consumer reviews was a previ-
responding to the two annotation steps. That is,         ously complied corpus, the SFU Review corpus,
first the marker words are detected and, thereafter,     to which annotations of negation and speculation
the task of determining the scope or classifying the     were added. The corpus contains consumer gener-
focus words is carried out.                              ated reviews of books, movies, music, cars, com-
   In this study, the first of the two subtasks of       puters, cookware and hotels (Taboada and Grieve,
negation/speculation detection is addressed, i.e.,       2004; Taboada et al., 2006). The corpus con-
the detection of marker words for negation and           sists of 17,000 sentences, of which 18% was anno-
speculation. This task is typically addressed us-        tated as containing negation and 22% as contain-
ing two main approaches, either a vocabulary of          ing speculation. 10% of the corpus was doubly an-
negation/speculation markers is compiled and to-         notated to measure inter-annotator agreement, re-
kens in the text are compared to this vocabulary in      sulting in an F-score and Kappa score of 92 for
order to determine whether they are marker words         negation markers and 89 for speculation markers.
(Chapman et al., 2001; Ahltorp et al., 2014), or
                                                            There are previous studies on the detection of
alternatively a machine learning model is trained.
                                                         speculation and negation markers in these two cor-
                                                         pora. A perfect precision and a recall of 0.98
3   Materials
                                                         were obtained, when training an IGTree classifier
Two English corpora were used in the experi-             to detect negation markers on the full paper sub-
ments, the Bioscope corpus (Vincze et al., 2008)         corpus of the BioScope corpus and evaluating it on
and the SFU Review corpus annotated for nega-            the clinical sub-corpus (Morante and Daelemans,
tion and speculation (Konstantinova et al., 2012).       2009b). Similar results for detecting negation
   As previously mentioned, the annotation guide-        markers in the clinical sub-corpus were achieved
lines for the SFU Review corpus were an adap-            by a vocabulary matching system. When using
tion of the guidelines for the BiosScope corpus,         the same set-up for detecting speculation markers,
and they were, therefore, very similar. In both cor-     i.e., training on the paper sub-corpus and evaluat-
pora, marker words expressing negation and spec-         ing on the clinical, a precision of 0.88 and a recall
ulation were annotated, as well as their scope. The      of 0.27 were achieved (Morante and Daelemans,
general principle for the length of text to anno-        2009a). For these experiments, the token to be
tate as marker words was to annotate the minimal         classified, as well as its immediate neighbouring
unit of text that still expresses negation or spec-      tokens were used as features. When instead train-
ulation. The definition of negation used for the         ing as well as evaluating on the clinical sub-corpus
task was “[...] the implication of the non-existence     (a conditional random fields model with tokens as
of something”, while speculation was defined as          features), a precision of 0.99 and a recall of 0.87
“[...] the possible existence of a thing, i.e. neither   were achieved for detecting speculation, while a
its existence nor its non-existence is unequivocally     rule-base vocabulary matching system achieved a
stated [...]”. Marker words could either be individ-     precision of 0.95 and a recall of 0.96 on this task
ual words that express negation or speculation on        (Agarwal and Yu, 2010). Examples of other re-
their own, e.g., “This {may} {indicate}..”, or com-      sults reported are a precision/recall of 0.97/0.98
plex expressions containing several words that do        for negation markers and 0.96/0.93 for specula-
not convey negation or speculation on their own,         tion markers (Cruz Dı́az et al., 2012), using a C4.5
e.g., “This {raises the question of}...”.                classifier and a support vector machine.
   The BioScope corpus consists of three sub-               There is also previous research on the detec-
corpora, containing clinical text, biological full       tion of which tokens that constitute negation and
papers and biological scientific abstracts. For          speculation markers in the SFU Review corpus
(Cruz et al., 2015). Experiments were conducted         the feature set, as the models were to be trained on
in which 10-fold cross-validation was applied on        a limited amount of data, features were restricted
the entire corpus, and a feature set that included      to the token that was to be classified, and, in addi-
the token and its closest neighbours was used. For      tion, a minimum of two occurrences of a token in
the most successful machine learning algorithm          the training data was required for it to be included.
(a cost-sensitive support vector machine), a pre-       As linear conditional random fields were used, the
cision of 0.80 and a recall of 0.98 were obtained       classification of a token was dependent on the clas-
for negation and a precision of 0.91 and a recall       sification of the two neighbouring tokens (Sutton
of 0.94 were obtained for speculation. For the          and McCallum, 2006), making it possible to detect
two other evaluated algorithms (Naive Bayes and         multi-word markers.
a support vector machine with a radial basic func-         For all setups, the models were trained with
tion kernel), much lower and slightly lower re-         an increasingly larger size of training data, from
sults, respectively, were obtained. Both of these       600 training instances to 3,000. In each itera-
two lower-performing models had problems han-           tion, 200 new training instances were randomly se-
dling multi-word markers for negation that in-          lected for inclusion in the training data. The same
cluded n’t or not, and results for these two mod-       experiment was repeated four times, each time
els were improved by a simple rule-based post-          with a new, randomly selected, subset of held-
processing algorithm specifically designed to han-      out data to use for evaluation in setups i) and ii),
dle these cases.                                        and (for all experiments) new random selections
                                                        of training instances. Precision, recall and F-score
4   Experiments                                         for recognising segments that were classified as
                                                        negation- or speculation markers were measured
Experiments consisted of training machine learn-        with NLTK’s ChunkScore class (Bird, 2002).
ing models to recognise markers for negation and
speculation and, thereafter, evaluate these mod-        5    Results and discussion
els. Three setups were used: i) models trained
                                                        For detecting speculation markers in the SFU Re-
on a subset of the BioScope corpus and evaluated
                                                        view corpus, and for detecting both speculation
on another subset of the same corpus, ii) models
                                                        and negation markers in the BioScope corpus
trained on a subset of the SFU Review corpus and
                                                        when trained on text of the same genre, the method
evaluated on another subset of this corpus, and fi-
                                                        was relatively successful (Figure 1), achieving re-
nally iii) models trained on the SFU Review cor-
                                                        sults in line with the inter-annotator agreement.1
pus and evaluated on the BioScope corpus. The
                                                        For detecting negation, the increase in training
rationale for performing the last experiment was
                                                        data size did not affect these results, while the gen-
the difficulty that is often associated with getting
                                                        eral trend for speculation was an improvement of
access to large amounts of clinical text, due to the
                                                        results with more training samples, although re-
sensitive content of text belonging to this genre.
                                                        sults remained slightly unstable.
If it would be possible to successfully apply a
                                                           For detecting negation in the SFU Review cor-
model trained on non-clinical text on the clinical
                                                        pus, on the other hand, results were much lower
text genre, this might be a possible solution in
                                                        than the measured agreement figures. Results
cases when the amount of available clinical data
                                                        were consistently low for all four folds (F-scores
is scarce.
                                                        0.70/0.75/0.76/0.74 for 3,000 training instances),
   The text segments annotated as negation- and
                                                        and the F-score decreased with a larger training
speculation markers were coded according to the
                                                        data set due to a decrease in precision, and a recall
BIO-format, i.e., a token could be the beginning
                                                        that remained low. It could be ruled out that the
of, inside or outside of a marker segment. The ap-
                                                        low results were due to the relatively small train-
proach of structured prediction was taken, and the
                                                        ing data size, since an additional model, trained on
PyStruct package was used (Müller and Behnke,
                                                        8,000 samples, gave even lower results (an F-score
2014) to train a linear conditional random fields
                                                        of 0.62). Multi-token negation markers including
model, using the OneSlackSSVM class. Default
                                                           1
parameters were used (which included a regular-              Previous machine learning results have typically been
                                                        achieved using a larger training set, and, therefore, a com-
isation parameter of 1) and a maximum of 100            parison to the agreement figures was carried out, instead of a
passes over the dataset to find constraints. To limit   comparison to previous results.
                                                                                                 1
Negation                  1                                   1
                                                                                                                           Bioscope
                         0.9                                 0.9                                0.9
                                                                                                                           SFU


             Precision




                                                                                      F-score
                                                                                                0.8




                                                    Recall
                         0.8                                 0.8                                                           SFU/Bioscope
                         0.7                                 0.7                                0.7

                         0.6                                 0.6                                0.6

                         0.5                                 0.5                                0.5
                               1000   2000   3000                  1000 2000 3000                     1000   2000   3000
                                                                   Training samples
Speculation               1                                   1                                  1

                         0.9                                 0.9                                0.9
             Precision




                                                                                      F-score
                                                    Recall
                         0.8                                 0.8                                0.8
                         0.7                                 0.7                                0.7
                         0.6                                 0.6                                0.6
                         0.5                                 0.5                                0.5
                               1000   2000   3000                  1000 2000 3000                     1000   2000   3000
                                                                   Training samples

Figure 1: Average results for different number of training samples. SFU/BioScope is the model trained
on the SFU Review corpus and applied on the BioScope corpus.


n’t or not were, however, very common among                                ining incorrectly classified segments showed that
false negatives and positives, and it is therefore                         false negatives were not limited to marker words
likely that the low results for this category were                         that might be more typical to the reasoning style
due to the inability of the trained model to detect                        of the clinical genre, e.g., evaluate, suggest, indi-
multi-token negations, i.e., the same problem that                         cate, compatible, consistent and question, but also
arose for two of the models trained by Cruz et                             included general expressions such as possible and
al. (2015). This might, for instance, be an effect                         probable.
of not including the neighbouring words as fea-                               Results also show that not even lessons learnt
tures. The models were, however, in general able                           for the choice of appropriate machine learning al-
to detect multi-word marker words, e.g., the fol-                          gorithms and features are transferable across gen-
lowing complex speculation markers I-’d-suggest,                           res, as the techniques for detecting negation that
would-think, can-either, might-expect, would-feel.                         was shown successful for the BioScope corpus
There were also a number of complex expres-                                produced low results on the SFU Review corpus.
sions among the false positives for speculation,                           Future work includes research on whether these
that might be considered as belonging to this class,                       findings also hold for the scope of the markers.
despite not being annotated as such. Examples are
can-hope, can-either, to-think.                                            6   Conclusion
   Also the setting of training the model on the
SFU Review corpus and evaluating it on the Bio-                            In the BioScope corpus, speculation and negation
Scope corpus gave low results for negation as well                         markers were detected with results close to previ-
as for speculation. It can, however, be observed                           ously reported annotator agreement scores. This
that for speculation markers, this strategy was                            was also the case for speculation markers in the
more successful than the previously explored strat-                        SFU Review corpus, while detection of negation
egy of training a model on biomedical article texts                        markers was unsuccessful in this genre. To train
and applying it on the clinical text genre (Morante                        the model on consumer reviews and apply it on
and Daelemans, 2009a). There might thus be a                               clinical text also yielded low results, showing that
larger similarity between how speculation is ex-                           neither the trained models, nor the choice of ap-
pressed in consumer reviews and in clinical texts,                         propriate algorithms and features, were transfer-
than between clinical and biomedical texts. Exam-                          able across the two text genres.
Acknowledgements                                             Research Workshop associated with The 8th Inter-
                                                             national Conference on Recent Advances in Natural
This work was funded by the StaViCTA project,                Language Processing, RANLP 2011, 13 September,
framework grant “the Digitized Society – Past,               2011, Hissar, Bulgaria, pages 139–144.
Present, and Future” with No. 2012-5659 from the           Natalia Konstantinova, Sheila C.M. de Sousa, Noa P.
Swedish Research Council (Vetenskapsrådet).                 Cruz, Manuel J. Maña, Maite Taboada, and Rus-
                                                             lan Mitkov. 2012. A review corpus annotated for
                                                             negation, speculation and their scope. In Nico-
References                                                   letta Calzolari, Khalid Choukri, Thierry Declerck,
                                                             Mehmet Uğur Doğanur, Bente Maegaard, Joseph
Shashank Agarwal and Hong Yu. 2010. Detecting                Mariani, Jan Odijk, and Stelios Piperidis, editors,
  hedge cues and their scope in biomedical text with         Proceedings of the Eight International Conference
  conditional random fields. Journal of Biomedical           on Language Resources and Evaluation (LREC),
  Informatics, 43(6):953 – 961.                              pages 3190–3195, Istanbul, Turkey. European Lan-
                                                             guage Resources Association (ELRA).
Magnus Ahltorp, Hideyuki Tanushi, Shiho Kita-
 jima, Maria Skeppstedt, Rafal Rzepka, and Kenji           Roser Morante and Eduardo Blanco. 2012. *sem 2012
 Araki. 2014. HokuMed in NTCIR-11 MedNLP-                    shared task: resolving the scope and focus of nega-
 2:Automatic extraction of medical complaints from           tion. In Proceedings of the First Joint Conference on
 Japanese health records using machine learning and          Lexical and Computational Semantics - Volume 1:
 rule-based methods. In Proceedings of NTCIR-11,             Proceedings of the main conference and the shared
 pages 158–162.                                              task, and Volume 2: Proceedings of the Sixth In-
                                                             ternational Workshop on Semantic Evaluation, Se-
Eiji Aramaki, Mizuki Morita, Yoshinobu Kano, and             mEval ’12, pages 265—274.
   Tomoko Ohkuma. 2014. Overview of the NTCIR-
   11 MedNLP-2 Task. In Proceedings of NTCIR-11,           Roser Morante and Walter Daelemans. 2009a. Learn-
   pages 147–154.                                            ing the scope of hedge cues in biomedical texts. In
                                                             Proceedings of the Workshop on Current Trends in
Steven Bird. 2002. Nltk: The natural language toolkit.       Biomedical Natural Language Processing, BioNLP
   In Proceedings of the ACL Workshop on Effective           ’09, pages 28–36, Stroudsburg, PA, USA. Associa-
   Tools and Methodologies for Teaching Natural Lan-         tion for Computational Linguistics.
   guage Processing and Computational Linguistics,
   Stroudsburg, PA, USA. Association for Computa-          Roser Morante and Walter Daelemans. 2009b. A met-
   tional Linguistics.                                       alearning approach to processing the scope of nega-
                                                             tion. In CoNLL ’09: Proceedings of the Thirteenth
Behrouz Bokharaeian, Alberto Diaz, Mariana Neves,            Conference on Computational Natural Language
  and Virginia Francisco. 2014. Exploring nega-              Learning, pages 21–29, Morristown, NJ, USA. As-
  tion annotations in the drugddi corpus. In Fourth          sociation for Computational Linguistics.
  Workshop on Building and Evaluating Resources for
  Health and Biomedical Text Processing (BIOTxtM           Andreas C. Müller and Sven Behnke. 2014. pystruct -
  2014).                                                     learning structured prediction in python. Journal of
                                                             Machine Learning Research, 15:2055–2060.
Wendy W. Chapman, Will Bridewell, Paul Hanbury,
                                                           Charles. Sutton and Andrew McCallum. 2006. An in-
  Gregory F. Cooper, and Bruce G. Buchanan. 2001.
                                                             troduction to conditional random fields for relational
  A simple algorithm for identifying negated findings
                                                             learning. In Lise Getoor and Ben Taskar, editors,
  and diseases in discharge summaries. J Biomed In-
                                                             Introduction to Statistical Relational Learning. MIT
  form, 34(5):301–310, Oct.
                                                             Press.
Noa P. Cruz, Maite Taboada, and Ruslan Mitkov. 2015.       Maite Taboada and Jack Grieve. 2004. Analyzing ap-
  A machine-learning approach to negation and spec-         praisal automatically. In Proceedings of the AAAI
  ulation detection for sentiment analysis. Journal of      Spring Symposium on Exploring Attitude and Affect
  the Association for Information Science and Tech-         in Text: Theories and Applications, pages 158–161.
  nology, pages 526–558.
                                                           Maite Taboada, Caroline Anthony, and Kimberly Voll.
Noa P Cruz Dı́az, Manuel J Maña López, Jacinto Mata       2006. Methods for creating semantic orientation
  Vázquez, and Victoria Pachón Álvarez. 2012. A          dictionaries. In Proceedings of 5th International
  machine-learning approach to negation and specula-        Conference on Language Resources and Evalua-
  tion detection in clinical texts. Journal of the Amer-    tion (LREC), pages 427–432, Genoa, Italy. Euro-
  ican society for information science and technology,      pean Language Resources Association (ELRA).
  63(7):1398–1410.
                                                           Özlem. Uzuner, Brett R. South, Shuying Shen, and
Natalia Konstantinova and Sheila C. M. de Sousa.              Scott L. DuVall. 2011. 2010 i2b2/va challenge on
  2011. Annotating negation and speculation: the case         concepts, assertions, and relations in clinical text. J
  of the review domain. In Proceedings of the Student         Am Med Inform Assoc, 18(5):552–556.
Sumithra Velupillai, Hercules Dalianis, and Maria
  Kvist. 2011. Factuality Levels of Diagnoses in
  Swedish Clinical Text. In A. Moen, S. K. Ander-
  sen, J. Aarts, and P. Hurlen, editors, Proc. XXIII In-
  ternational Conference of the European Federation
  for Medical Informatics (User Centred Networked
  Health Care), pages 559–563, Oslo, August. IOS
  Press.
Sumithra Velupillai. 2012. Shades of Certainty –
  Annotation and Classification of Swedish Medical
  Records. Doctoral thesis, Department of Computer
  and Systems Sciences, Stockholm University, Stock-
  holm, Sweden, April.
Veronika Vincze, György Szarvas, Richárd Farkas,
  György Móra, and János Csirik. 2008. The Bio-
  Scope Corpus: Biomedical texts annotated for un-
  certainty, negation and their scopes. BMC Bioinfor-
  matics, 9 (Suppl 11):S9.