=Paper=
{{Paper
|id=Vol-2429/paper8
|storemode=property
|title=Extracting Supporting Evidence from Medical Negligence Claim Texts
|pdfUrl=https://ceur-ws.org/Vol-2429/paper8.pdf
|volume=Vol-2429
|authors=Robert Bevan,Alessandro Torrisi,Danushka Bollegala,Frans Coenen,Katie Atkinson
|dblpUrl=https://dblp.org/rec/conf/ijcai/BevanTBCA19
}}
==Extracting Supporting Evidence from Medical Negligence Claim Texts==
<pdf width="1500px">https://ceur-ws.org/Vol-2429/paper8.pdf</pdf>
<pre>
          Extracting Supporting Evidence from Medical Negligence Claim Texts

    Robert Bevan† , Alessandro Torrisi† , Danushka Bollegala† , Frans Coenen† , Katie Atkinson†
                                        † University of Liverpool
           {robert.bevan, alessandro.torrisi, danushka, coenen, k.m.atkinson}@liverpool.ac.uk


                              Abstract                                       University hospital mistakenly amputated my left leg
                                                                           despite the fact the cancer was confined within my right
     The number of medical negligence claims filed in                      leg. I will now need to undergo another leg amputation
     the UK each year has increased significantly over                    and will be confined to a wheelchair for the rest of my life.
     the past decade [NHS, 2018]. When filing a med-
     ical negligence claim, electronic health records act                   Figure 1: An example extraction (performed by a human).
     as a legally valid important source of evidence. Pa-
     tients often undergo different and complex treat-
     ments over many months or years, easily result-                      (or a legal representative acting on behalf of a patient), would
     ing in hundreds of pages of electronically available                 like to prosecute the health care provider for medical negli-
     medical records. Therefore, it is a non-trivial task                 gence, a legal case must be filed based on medical evidence.
     to read all the related electronic health records and                An important source of medical evidence for such prevention
     identify the supporting evidence to establish a le-                  efforts or litigation processes is the electronic health records
     gal case. Currently, the process of identifying ev-                  describing the various treatments undergone by the patient,
     idence is carried out by humans who are experts                      the medication prescribed for the patient, and their medical
     in both medical negligence law and medicine. In                      history. The volume of electronic health records for a sin-
     this paper, we compare different methods of auto-                    gle patient can be significant. It is not uncommon for a pa-
     matically extracting relevant statements from med-                   tient to be subjected to medical treatment for many months,
     ical negligence claim texts, to move towards build-                  if not years, and typically a much smaller set of relevant ev-
     ing a method for extracting relevant sections from                   idence supporting the medical negligence case must be iden-
     electronic health records with the aim of expedit-                   tified from this vast amount of information. Furthermore, fil-
     ing the litigation process and reducing the man-                     tering electronic health records according to the date of the
     ual efforts involved. Specifically, we annotate a                    alleged negligent act is not sufficient when building a body of
     dataset containing medical negligence claim texts                    evidence due to the non-contiguous distribution of evidence
     and train conditional random field (CRF) and long                    contained within the records. For example, negative patient
     short-term memory (LSTM) network models for                          outcomes may occur years after an initial negligent act, there-
     extracting information relevant to cases. Our eval-                  fore filtering records by date may result in evidence being
     uation shows that each model class has its merits in                 discarded.
     this task: the CRF models were significantly more                        The existing process for identifying supporting evidence
     effective in identifying full sequences, while the                   from electronic health records is a manual one. Humans
     LSTMs were significantly better at assigning tags                    who are knowledgeable in both medical negligence law and
     to tokens. We found both approaches were able to                     medicine must manually read a collection of medical records
     identify information that is key to the litigation pro-              and then carefully select parts that can be used as evidence
     cess.                                                                in the litigation process. Needless to say, this is both a time
                                                                          consuming and a costly process. Moreover, the number of
                                                                          individuals possessing both legal and medical background
1    Introduction                                                         knowledge is small, which means a limited number of med-
Medical negligence claims are a significant source of litiga-             ical records can be read and analysed over a given period of
tion. For example, in 2018, the national health service (NHS)             time. These drawbacks in the existing pipeline for extract-
in the United Kingdom reported that it paid GBP 1,623 mil-                ing evidence call for automatic methods that can efficiently
lion as compensation for 10,637 claims [NHS, 2018]. Acts of               “read” large quantities of medical records and accurately ex-
medical negligence can vary in complexity as well as sever-               tract the relevant evidence.
ity. Finding the reasons behind medical negligence acts is                    In this paper, given medical negligence claim texts, we
important in order to prevent such unfortunate events in the              compare methods of automatically extracting expressions that
future [Toyabe, 2012]. Moreover, in the event where a patient             are relevant to the medical negligence case: the alleged neg-
    Copyright © 2019 for this paper by its authors. Use permitted under
    Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                   50
ligent acts, and any consequential negative patient outcomes.              Statement type        Count     Mean word count
This can be useful in helping lawyers quickly establish the
key elements of the case, and we conjecture this will be use-              negligent act          2551               11 (+/- 6)
ful as part of a system for automatically extracting supporting            negative outcome       5510                4 (+/- 3)
evidence from medical records.
                                                                                        Table 1: Dataset summary.
   Specifically, first we manually annotate a set of medical
negligence claim texts, identifying any statements of negli-
gent acts and any consequential negative patient outcomes.            Generic           Sparseness                  Domain specific
An example is shown in Figure 1, where text relating to               word              stem                        sentiment
negligent acts and negative outcomes are highlighted in red           word suffixes     stem suffixes               in medical lexicon
and blue respectively. Next, we train a Conditional Ran-              is upper case     similar words               in first sentence
dom Field (CRF) [Lafferty et al., 2001] model for predict-            is title          similar word suffixes
ing BIO (Begin-Inside-Outside) tags for extracting sequences          is digit
of tokens in texts belonging to the previously described cat-         POS tag
egories. We use different types of features such as Part of           POS tag suffix
Speech (POS), typography, and medical lexicons. One is-               is first word
sue we encounter in this approach is the data sparseness –            is last word
the limited overlap of the tokens between the training and
testing data. To overcome this data sparseness issue, we use
                                                                                Table 2: Features used in CRF experiments.
pre-trained word embeddings and automatically append train-
ing instances with related features that did not appear in the
original training instances. Our experimental results show        sequences. For example, the evidence extracted in Figure 1
that this feature augmentation approach successfully over-        contains the sequence of words mistakenly amputated my left
comes the data sparseness problem. Finally, we train various      leg. Second, unlike relations or entities, it is non-obvious
Long Short-Term Memory (LSTM) networks [Hochreiter and            how to classify negligence related evidence into categories.
Schmidhuber, 1997] for the same task. We experiment with          This becomes problematic when generalising the extraction
both regular and Bidirectional LSTMs (BiLSTMs), and make          rules from one domain to another. To the best of our knowl-
use of both word and character level features.                    edge, the problem of extracting medical negligence related
                                                                  evidence from free text data has not been studied before.
2   Related Work
                                                                  3     Evidence Extraction
Information extraction has a long and established history as a
task in NLP. In Named Entity Recognition (NER) [Shen et al.,      CRFs and LSTMs are two classes of models that perform
2018; Kuru et al., 2016; Ritter et al., 2011; Guo et al., 2009;   well, and are often employed, in a range of sequence labelling
Rud et al., 2011], the goal is to extract mentions of named       tasks [Huang et al., 2015; SHI et al., 2015; McCallum and
entities such as people, locations, organisations, products       Li, 2003]. Both model classes are able to leverage historical
etc. It has been reported that over 70% of web search             and future sequence information when classifying the current
queries contain some form of a named entity [Guo et al.,          sequence element. This makes them well suited to natural
2009]. Therefore, being able to recognise named entities          language processing tasks. One advantage LSTMs have over
enables us to find more relevant results in information re-       CRFs is their ability to learn feature representations that are
trieval. Relation Extraction (RE) [Mandya et al., 2017;           specific to the task at hand. We employ both model classes
Miwa and Bansal, 2016] further extends this process by iden-      in this work and compare their performance in the task of
tifying the semantic relations that exist between two or more     identifying negligent acts and consequential negative patient
recognised named entities. For example, a competitor rela-        outcomes from medical negligence claim texts.
tion can exist between two companies, which can later trans-         The dataset used in this evaluation comprises 2014 medi-
form into an acquisition relation. In medical contexts, iden-     cal negligence claim summary texts collected by a law firm
tifying the adverse reactions associated with drugs (ADRs)        operating in the medical negligence domain. These texts con-
from formal reporting tools, such as the Yellow Card Sys-         tain statements describing negligent acts as well as any con-
tem, or more informal reporting methods, such as social           sequential negative patient outcomes (Figure 1). The texts
media, has received wide attention [Bollegala et al., 2018;       were annotated by a domain expert with BIO tags delineating
Sloane et al., 2015].                                             negligent act statements and consequential negative patient
                                                                  outcome statements. Table 1 shows some dataset statistics.
   Our problem: extracting litigation relevant statements from
                                                                  Due to the confidential nature of this dataset, we are unable
medical negligence case texts, can be seen as a specific in-
                                                                  to share it publicly.
stance of the above-described information extraction prob-
lem. However, there are some important properties in our
case, which differentiate it from the more popular informa-       4     Experiments
tion extraction problems such as NER, RE or ADR extrac-           CRF models were trained using various combinations of the
tion. First, compared to, for example, named entities, evi-       features listed in Table 2. The features listed in the left-
dence related to medical negligence tends to comprise longer      hand column are common to most text tagging tasks. Those


                                                                                                                                  51
                  LSTM settings                                    CRF feature set                        Prec      Rec       F1
                  LSTM                                             Base                                   0.486     0.385     0.428
                  LSTM + GloVe                                     Base + stem                            0.486     0.384     0.427
                  LSTM + Char                                      Base + stem + suffix                   0.492     0.382     0.429
                  BiLSTM                                           Base + sentiment                       0.487     0.378     0.424
                  BiLSTM + Char                                    Base + in medical lexicon              0.468     0.379     0.417*
                  BiLSTM + GloVe + Char                            Base + in first sentence               0.495     0.396     0.438
                                                                   Base + 7 similar words                 0.497     0.406     0.445*
    Table 3: LSTM configurations used in these experiments.        Base + 6 similar words + suffix        0.489     0.406     0.443*

                                                                  Table 4: Selected CRF model performance evaluated at the sequence
listed in the middle column were introduced to address the        level (micro-averaged). Note: Base refers to the the baseline CRF,
problem of data sparseness. The similar word features re-         which made use of generic features only; best results in bold font; *
quire further explanation; these were generated using pre-        indicates a significant result (P=0.05, Bonferroni corrected).
trained GloVe [Pennington et al., 2014] embeddings: given
a word, the N words with the highest cosine similarity were
included as additional features; the value for N was varied          Configuration                   Prec        Rec        F1
(N={1..10}). Similar word suffix features were also experi-          LSTM                            0.245       0.252      0.248
mented with. The features in the right-hand column are do-           LSTM + GloVe                    0.195*      0.215*     0.205*
main specific. For example, it was observed that negligent act       LSTM + Char                     0.260       0.286      0.272
statements are often present in the first sentence of a claim        BiLSTM                          0.230       0.242      0.236*
text. Also, negligent act statements frequently contain medi-        BiLSTM + Char                   0.256       0.273*     0.264
cal terminology. The listed features were computed for each          BiLSTM + Char + GloVe           0.197*      0.219*     0.207*
token in each sequence as well as the preceding and follow-
ing tokens. All CRF models were trained using the sklearn-        Table 5: LSTM performance evaluated at the sequence level (micro-
crfsuite Python package [Korobov, 2017]. The following            averaged). Note: best results in bold font; * indicates a significant
hyper-parameters were tuned using a randomised search over        result (P=0.05, Bonferroni corrected).
50 iterations: the Elastic net regularisation coefficient, the
minimum feature frequency, and the possible state and tran-
sition features.                                                  outcome labels only (i.e. “other” tags were ignored). Neither
   We experimented with various LSTM configurations (see          evaluation scheme is perfectly suited to identifying the best
Table 3). The baseline LSTM comprised a 50-dimensional            performing sequence tagger. For example, evaluating models
word embedding input, a single LSTM layer of 16 hidden            at the sequence level only will discount any examples where
units, and a softmax output. This model was trained both with     the system correctly identifies the vast majority of a sequence,
random and pre-trained GloVe word embedding initialisation.       but misses a single, minimally important term. Similarly, to-
A bi-directional variant of the baseline LSTM was also exper-     ken level evaluation is imperfect as it can mask pathological
imented with. In addition, the baseline model was extended        behaviour. For example, a system can correctly identify the
to include character-level features. This was achieved using      majority of a phrase but fail to identify a single important
a convolutional layer containing 8 hidden units, with a 16-       component (e.g. “no longer have any mobility in my”) and
dimensional character embedding input. All LSTM models            still score highly using this scheme. While it is not perfect,
were trained using the NCRF++ Python package [Yang and            we suggest the phrase level evaluation is likely to be a better
Zhang, 2018]. Each LSTM was trained for 100 epochs us-            indicator of a model’s usefulness in practice. In order to test
ing stochastic gradient descent with a learning rate of 0.015,    for the statistical significance of the results, we employed the
a learning rate decay of 0.05, and a batch size of 32. During     corrected re-sampled t-test [Nadeau and Bengio, 2001], cou-
training, models were evaluated at the end of each epoch us-      pled with the Bonferroni correction for multiple comparisons
ing a validation set, and the best performing model (across the   [Dunn, 1961].
100 epochs) was selected for use in the evaluation. Training         Table 6 compares the best performing CRF and LSTM
was repeated 5 times for each LSTM configuration in order to      models. The CRF model performed significantly better at the
reduce the influence of pathological local minima, but none       sequence level, while the LSTM offered significantly better
were observed, therefore we randomly selected one of the 5        token level performance. Inspecting extractions performed on
models for the evaluation (for each of the different configura-   a test set can be useful in comparing models. Figure 2 shows
tions).                                                           some example extractions performed using these two mod-
                                                                  els. The outputs of the different models vary considerably:
                                                                  the two approaches only fully agree on a single instance (12
5   Results                                                       instances in total). The LSTM repeatedly fails to identify the
The different methods were compared using a 5-Fold Cross          beginning of the sequences: it only outputs a single B tag (a B
Validation scheme. Performance metrics were computed both         tag indicates the first term in a sequence) out of a possible 12,
at the sequence level and the token level. Token level metrics    whereas the CRF outputs 9 B tags. The LSTM exhibits addi-
were computed using the negligent act and negative patient        tional undesirable behaviour: it erroneously splits sequences


                                                                                                                                 52
University hospital mistakenly amputated my left leg despite                           Sequence level evaluation
the fact the cancer was confined within my right leg. I will
now need to undergo another leg amputation and will be                               Prec             Rec         F1
confined to a wheelchair for the rest of my life.                                CRF    LSTM CRF         LSTM CRF   LSTM
                                                                         NA      0.50*     0.32       0.43      0.41       0.46*     0.36
                                                                         O       0.49*     0.22       0.39*     0.23       0.44*     0.23
I believe the University pharmacy to be negligent as                    AVG      0.49*     0.26       0.40*     0.29       0.44*     0.27
they misprescribed me with ibuprofen when they should have
given me paracetamol. I felt sick for a week as a result.
                                                                                         Token level evaluation
I believe the midwife at University hospital was at fault                            Prec              Rec                         F1
because she dropped my newborn Son. This caused his arm                          CRF    LSTM CRF         LSTM              CRF       LSTM
to break, and his head is now misshapen. We are unsure if
                                                                       B-NA      0.68      0.87*      0.59      0.60       0.63      0.71*
his head will ever regain its original shape, or if he will have
                                                                       I-NA      0.63      0.81*      0.67      0.77*      0.65      0.79*
lasting problems with his arm.
                                                                        B-O      0.60      0.75*      0.48*     0.24       0.54*     0.37
                                                                        I-O      0.62      0.72*      0.55*     0.50       0.59      0.59
                                                                       AVG       0.63      0.78*      0.60      0.61       0.61      0.67*
I believe the GP at the Village Health Centre should
have noticed the lump when I first presented with my symp-
toms. My cancer diagnosis has now been delayed by 15                  Table 6: Comparison of the best performing CRF and LSTM mod-
                                                                      els evaluated at the phrase and token levels. Note: “NA” refers to
months, and the prognosis is much worse.                              negligent act; “O” refers to consequential negative outcome; AVG
                                                                      refers to the micro average; best results in bold font; * indicates a
Figure 2: Example extractions performed by the CRF and LSTM.          significant result (P=0.05, Bonferroni corrected).
Tokens underlined with blue were identified by the CRF only; tokens
underlined with red were identified by the LSTM only, and tokens
underlined with violet were identified by both models.                from medical negligence claim texts. We observed that the
                                                                      CRF was better able to identify entire useful phrases, while
in two, often dropping a common word. It appears that the             the LSTM was able to assign labels to tokens with higher
LSTM is giving too much consideration to the current word,            precision. The best performing CRF model’s ability to iden-
and the previous sequence information is discounted. Both             tify evidence is likely sufficient for it to be useful in practice.
approaches make some subtle mistakes that produce extrac-             We found that enriching the CRF features with similar words,
tions that appear to be correct at a first glance, but are actually   computed using pre-trained word embeddings, improved the
incorrect. For example, in the third example in Figure 2, the         CRF’s performance. We also observed including domain spe-
CRF identifies the sequence “lasting problems with his arm”,          cific features improved the CRF’s performance. While the
when in reality in the statement the author suggests they are         evaluation suggests the CRF is better suited to this task than
unsure whether the child will have lasting problems with their        the LSTM, we recognise it may well be biased in favour of
arm. Extractions like this could prove to be problematic, if          the CRF. This is because we experimented with few LSTM
such a system is used to quickly extract the key case facts           architectures, and the architecture is an important hyperpa-
from a statement.                                                     rameter when training neural network models. In future work
   Tables 4 and 5 compare the different CRF feature sets and          we plan to experiment further with the LSTM architecture.
LSTM configurations. The different LSTM configurations                Specifically, we plan to vary the dimensionality of the vari-
performed similarly well, except for in cases where the word          ous embedding and hidden layers. We also plan to experiment
embeddings were initialised using pre-trained GloVe vec-              with a CRF output layer with the view that this will likely im-
tors – in these instances the models performed significantly          prove the LSTM’s sequence level performance. We also plan
worse than the baseline LSTM. We also found that train-               to collect more data, which may benefit both approaches and
ing a BiLSTM with character level features significantly im-          further assist with the development of our automated tools for
proved recall. Moreover, we found that adding sparsesness-            processing medical negligence documents.
counteracting features improved CRF performance – the best
performing CRF model made use of similar word features                References
(N=7). We also found adding domain specific features to be            [Bollegala et al., 2018] Danushka   Bollegala,     Simon
helpful: including whether or not a word occurs in the claim            Maskell, Richard Sloane, Joanna Hajne, and Munir
text’s first sentence as a feature significantly improved token         Pirmohamed. Causality patterns for detecting adverse
level performance. This feature was strongly associated with            drug reactions from social media: Text mining approach.
the negligent act class.                                                JMIR Public Health and Surveillance, 4(2):e51, May
                                                                        2018.
6   Discussion and Conclusion                                         [Dunn, 1961] Olive Jean Dunn.       Multiple comparisons
In this set of experiments we found both CRF and LSTM                   among means. American Statistical Association, pages
models were able to extract litigation-relevant information             52–64, 1961.


                                                                                                                                   53
[Guo et al., 2009] Jiafeng Guo, Gu Xu, Xueqi Cheng, and          [Rud et al., 2011] Stefan Rud, Massimiliano Ciaramita, Jens
   Hang Li. Named entity recognition in query. In SIGIR             Muller, and Hinrich Schutze. Piggyback: Using search
   2009, pages 267–274, 2009.                                       engines for robust cross-domain named entity recognition.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and              In ACL’11, 2011.
   Jürgen Schmidhuber. Long short-term memory. Neural           [Shen et al., 2018] Yanyao Shen, Hyokun Yun, Zachary C.
   Comput., 9(8):1735–1780, November 1997.                          Lipton, Yakov Kronrod, and Animashree Anandkumar.
[Huang et al., 2015] Zhiheng Huang, Wei Xu, and Kai Yu.             Deep active learning for named entity recognition. In
   Bidirectional LSTM-CRF Models for Sequence Tagging.              International Conference on Learning Representations,
   arXiv e-prints, page arXiv:1508.01991, Aug 2015.                 2018.
[Korobov, 2017] Mikhail Korobov. sklearn-crfsuite. https:        [SHI et al., 2015] Xingjian SHI, Zhourong Chen, Hao Wang,
   //github.com/TeamHG-Memex/sklearn-crfsuite, 2017.                Dit-Yan Yeung, Wai-kin Wong, and Wang-chun WOO.
                                                                    Convolutional lstm network: A machine learning ap-
[Kuru et al., 2016] Onur Kuru, Ozan Arkan Can, and Deniz            proach for precipitation nowcasting. In C. Cortes, N. D.
   Yuret. Charner: Character-level named entity recognition.        Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, edi-
   In Proceedings of COLING 2016, the 26th International            tors, Advances in Neural Information Processing Systems
   Conference on Computational Linguistics: Technical Pa-           28, pages 802–810. Curran Associates, Inc., 2015.
   pers, pages 911–921, Osaka, Japan, December 2016. The
   COLING 2016 Organizing Committee.                             [Sloane et al., 2015] Richard Sloane, Orod Osanlou, David
                                                                    Lewis, Danushka Bollegala, Simon Maskell, and Munir
[Lafferty et al., 2001] John Lafferty, Andrew McCallum,             Pirmohamed. Social media and pharmacovigilance: A re-
   and Fernando Pereira. Conditional random fields: Proba-          view of the opportunities and challenges. British Journal
   bilistic models for segmenting and labeling sequence data.       of Clinical Pharmacology, pages 910–920, 2015.
   In ICML 2001, pages 282–289, 2001.
                                                                 [Toyabe, 2012] Shin-ichi Toyabe. Detecting inpatient falls
[Mandya et al., 2017] Angrosh Mandya, Danushka Bolle-               by using natural language processing of electronic medical
   gala, Frans Coenen, and Katie Atkinson. Frame-based              records. BMC Health Services Research, 12(1), Dec 2012.
   semantic patterns for relation extraction. In Proc. of the
   15th International Conference of the Pacific Association      [Yang and Zhang, 2018] Jie Yang and Yue Zhang. Ncrf++:
   for Computational Linguistics (PACLING), pages 51–62,            An open-source neural sequence labeling toolkit. In Pro-
   2017.                                                            ceedings of the 56th Annual Meeting of the Association for
                                                                    Computational Linguistics, 2018.
[McCallum and Li, 2003] Andrew McCallum and Wei Li.
   Early results for named entity recognition with conditional
   random fields, feature induction and web-enhanced lexi-
   cons. In Proceedings of the Seventh Conference on Nat-
   ural Language Learning at HLT-NAACL 2003 - Volume
   4, CONLL ’03, pages 188–191, Stroudsburg, PA, USA,
   2003. Association for Computational Linguistics.
[Miwa and Bansal, 2016] Makoto Miwa and Mohit Bansal.
   End-to-end relation extraction using lstms on sequences
   and tree structures. In Proceedings of the 54th Annual
   Meeting of the Association for Computational Linguistics
   (Volume 1: Long Papers), pages 1105–1116, Berlin, Ger-
   many, August 2016. Association for Computational Lin-
   guistics.
[Nadeau and Bengio, 2001] Claude Nadeau and Yoshua
   Bengio. Inference for the generalization error. Machine
   Learning, 2001.
[NHS, 2018] NHS. Annual report and accounts 2017/18.
   Technical report, National Health Service (NHS) Resolu-
   tion, 2018.
[Pennington et al., 2014] Jeffery Pennington,         Richard
   Socher, and Christopher D. Manning. Glove: global
   vectors for word representation. In Proc. of EMNLP,
   pages 1532–1543, 2014.
[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and
   Oren Etzioni. Named entity recognition in tweets: An
   experimental study. In EMNLP’11, pages 1524 – 1534,
   2011.


                                                                                                                       54

</pre>