Narrative detection in online patient communities

Narrative detection in online patient communities AnneDirkson a.r.dirkson@liacs.leidenuniv.nl Leiden Institute of Advanced Computer Science Leiden University Niels Bohrweg

2333 CA Leiden the Netherlands

SuzanVerberne s.verberne@liacs.leidenuniv.nl Leiden Institute of Advanced Computer Science Leiden University Niels Bohrweg

2333 CA Leiden the Netherlands

WesselKraaij w.kraaij@liacs.leidenuniv.nl Leiden Institute of Advanced Computer Science Leiden University Niels Bohrweg

2333 CA Leiden the Netherlands

Narrative detection in online patient communities 9D6BE848024699CC068B5D2E737B73EC GROBID - A machine learning software for extracting information from scholarly documents

Although narratives on patient forums are a valuable source of medical information, their systematic detection and analysis has so far been limited to a single study. In this study, we examine whether psycholinguistic features or document embeddings can aid identification of narratives. We also investigate which features distinguish narratives from other social media posts. This study is the first to automatically identify the topics discussed in narratives on a patient forum. Our results show that for classifying narratives, character 3-grams outperform psycho-linguistic features and document embeddings. We found that narratives are characterized by the use of past tense, health-related words and first-person pronouns, whereas non-narrative text is associated with the future tense, emotional support words and second-person pronouns. Topic analysis of the patient narratives uncovered fourteen different medical topics, ranging from tumor surgery to side effects. Future work will use these methods to extract experiential patient knowledge from social media.

Introduction

Nowadays, online patient forums are the main medium by which patients exchange their narratives. These narratives mainly recount their own experiences with their condition. As such, they contain experiential knowledge [Bor76], defined as the knowledge that patients gain from their own experiences. In recent years, such experiential knowledge has increasingly been recognized as valuable and complementary to empirical knowledge [CBC + 13]. Consequently, more health-related applications are making use of patient forum data, for instance to track public health trends [SOG + 16] and to detect adverse drug responses [SGN + 15]. Experiential knowledge is also valuable for patients themselves: patients indicate that they strongly rely on experiences and information provided on patient forums [SHBL16]. This is especially true for patients with a rare disease, for which medical professionals often lack expertise and the number of studies is limited [AKG08].

To understand the experiential knowledge on patient forums, forum posts that contain narratives must first be identified. As of yet, research into systematically distinguishing patient narratives on patient forums is limited to a single study on Dutch forum data [VBSEng], which uses words as only features. We expand upon this work using a different data set by examining whether document embeddings and psycho-linguistic features can improve the identification of patient narratives. We expect so, because these aggregated features are less dependent on individual terms, which may overlap significantly between narratives and factual statements about the same topic. Secondly, we explore how narratives differ from other types of posts by studying which features are influential in identifying narratives and which posts are classified incorrectly. Thirdly, we analyze how prevalent narratives are on a cancer patient forum and which topics these narratives discuss.

Related Work

Narratives on patient forums have mainly been studied qualitatively (e.g. [vUKDT + 09]). The automatic identification of narratives on a patient forum is limited to the study by Verberne et al. [VBSEng] on a Dutch cancer forum. They identified narratives with a F 1 of 0.911 using only the lower-cased words of the posts as features. They also found that various linguistic factors (1st person singular, 3rd person and negations) and psychological processes (social processes and religion) were correlated with the presence of narratives. These psycho-linguistic features were measured using the Linguistic Inquiry and Word Count (LIWC) method [TP10].

Additionally, research into self-reported adverse drug responses (ADRs) has led to the development of classifiers for differentiating between factual statements of ADRs and personal experiences of ADRs on social media [BY12, NSO + 15, SG15]. However, these classifiers are highly specific and thus not suitable for identifying patient narratives in general.

Another closely related field is the classification of personal health mentions on social media i.e. posts that mention a person who is affected as well as their specific condition, such as: 'my granddad has Alzheimer's'. Presently, only two studies have investigated this task. The first by Lamb et al. [LPD13] focused on separating flu awareness from actual flu reports on social media. More recently, Karisani et al. [KA18] introduced WESPAD, a classifier for personal health mentions, which attains state-of-the-art performance for seven different health domains including stroke, depression and flu infection. Nonetheless, a personal health mention alone is not sufficient to consider the post a narrative, and thus these classifiers are also inadequate for our purpose.

Methods

Data

Our data consists of an open, international Facebook forum for patients with Gastrointestinal Stromal Tumor (GIST)1 . It is moderated by GIST Support International and consists of 36,722 posts with a median length of 20 tokens.

Preprocessing

The data was lowercased and tokenized with NLTK. Due to the noisy nature of user-generated content, especially in the spelling of medical terms, we applied a tailored preprocessing pipeline2 to our data. Firstly, an existing normalization pipeline for social media [Sar17] 3 was used to normalize tokens to American English and to expand generic abbreviations used on social media. Hereafter, domain-specific abbreviations were expanded with a lexicon of 42 non-ambiguous abbreviations, generated based on 1000 posts and annotated by a domain expert and the first author. Spelling mistakes were detected using a combination of relative frequency and edit distance to possible candidates and corrected using weighted Levenshtein distance. Correction candidates were derived from the corpus itself. Drug names were normalized using the RxNorm database [Nat]. Non-English posts were removed using langid [LB12]. Punctuation was removed, but stop words were not, as we expect function words to play a role in the expression of narratives.

Supervised classification

Manual annotation of example data

We randomly selected 1050 posts for annotation. The annotators were asked to indicate per message whether it contains a personal experience. They were not provided with its context. Personal experiences did not need to be about the author but could be about someone else. This definition was based on earlier work by Verberne et al. [VBSEng] and van Uden-Kraan et al. [vUKDT + 08]. The first 50 posts were annotated individually by the first author and another PhD student to improve the annotation guidelines. 4 The remaining 1000 posts were divided equally into six sets of 200 posts, with 40 posts (20%) overlapping between all sets. The overlap was used to calculate the pairwise Cohen's kappa. There were seven annotators in total: six PhD students and one GIST patient. Each sample was assigned to an annotator, apart from one sample which was divided between two PhD students. To be able to include the overlapping sample in the classification, we opted to use the annotations of the GIST patient for these 40 posts.5

Feature sets

Four feature sets were derived from the text data: word unigrams, character n-grams (using the CountVectorizer function in sklearn), psycho-linguistic features, and document embeddings. For both word unigrams and character n-grams, we investigated whether TF-IDF weighting would improve performance compared to raw counts. Additionally, we explored whether stemming or lemmatising the data prior to extracting the unigrams could improve performance. Psycho-linguistic features were based on the LIWC 2015 [TP10]. Punctuation categories were discarded, resulting in 82 LIWC features in total. LIWC is a well-known method for investigating psychological processes in text and includes both linguistic (e.g. first-person pronouns) and psychological categories (e.g. positive emotions). The last feature set consisted of document embeddings: a doc2vec model [LM14] was trained on the labeled training data for each fold in the cross-validation. We combine a distributed memory model with a distributed bag of words model, as recommended by Le and Mikolov [LM14]. We also attempted to train document embeddings first on the unsupervised data and then re-train on the supervised data, but this led to nonsensical classification features.

Supervised classification algorithms

Classifiers were evaluated separately for each feature set. We ignored all posts that had been left empty by the annotator (the annotator chose neither yes nor no): three posts were ignored for this reason. For word unigrams, character n-grams and psycho-linguistic features, we compared four sklearn classification algorithms: Multinomial Naive Bayes (MNB), linear Support Vector Classification (LinearSVC), Stochastic Gradient Descent (SGD) with log loss, and K Nearest Neighbours (KNN). These were chosen according to the following criteria: (1) known to perform well on text data, (2) recommended for small data sets and (3) able to calculate probabilistic outcomes. The latter enabled us to use probabilistic ensembles. The doc2vec representations combined with Logistic Regression were used as classifier in itself: the document representations were tagged with the labels of the training data. This model was then used to derive vector representations for new documents. To test if a combination of feature types could improve performance, we evaluated soft voting (argmax of the sums of the predicted probabilities) of the best individual classifiers for the best performing variants of each feature set. Significance testing was done with pair-wise t-tests.

To evaluate the performance, the average F 1 score of a 10-fold cross validation was used. For each run, hyper-parameters were tuned for that specific training set using a 10-fold grid search on the training data. The tuning grids were based on sklearn documentation: C from 10 -3 to 10 3 (steps of x10) for LinearSVC and Logistic Regression; number of neighbors from 3 to 11 (steps of 2) for KNN; and max iterations from 2 to 2048 (steps of x2) and alpha from 10 -8 to 10 -2 (steps of x10) for SGD. The dimensionality of the document vectors was tuned with a grid of 100 to 400 (steps of 100).

Topic modelling of the whole data set

To label the remaining data, the best performing classifier was used with the hyper-parameter settings that were optimal in the majority of the training sets. To investigate which topics are discussed in the patient narratives, we used topic modelling with non-Negative Matrix Factorization of the TF-IDF weighted tokens without stopwords. Topic coherence, measured using TC-W2V [OGCC15], was used to select the number of topics. Topic labels were assigned manually by exploring the words with the highest weights and the top-ranked (i.e. most relevant) messages per topic.

Results

Annotated data

The data was slightly imbalanced, with 37.7% of the posts containing a narrative, resulting in a majority baseline of roughly 0.62. The inter-annotator agreement was substantial (κ = 0.69).

Classifier evaluation

A Linear SVC on character 3-grams achieves the highest F 1 score (Table 1), although character 4-grams (p = 0.526), stemmed unigrams (p = 0.930) and lemmatised unigrams (p = 0.587) do not perform significantly worse. Character 5-and 6-grams also do not perform worse overall (p = 0.122 and p = 0.169), but their recall is significantly lower (p = 0.023 and p = 0.029). The classifiers for the best performing document embeddings (DBOW+DM) and psycho-linguistic features, however, are significantly worse overall than character 3-grams (p = 0.0055 and p = 0.026 respectively). Employing TF-IDF weighting does not aid any of the unigram or character n-gram features. Additionally, neither feature selection (F 1 =0.761) nor word boundaries (F 1 =0.796) improve the performance of character 3-grams. Using a range of character n-grams, namely 3-to-4 (F 1 =0.814), 3-to-5 (F 1 =0.814), or 3-to-6 (F 1 =0.812), also does not boost performance.

Ensemble classification did not perform better than character 3-grams alone (see Table 2). Nevertheless, an ensemble of all four feature types is significantly more precise than all other classifiers (p = 0.0048 compared to the second best). To further explore why ensemble classification does not manage to improve overall performance, we investigated the predictions of individual classifiers. As can be seen in Table 3, there is a high degree of overlap between the predictions based on character 3-grams and the other feature sets (88.3%, 83.8% and 84.4% respectively). Consequently, the vast majority of the predictions cannot be improved by complementing character 3-grams with these feature sets. Interestingly, 4.7% of the posts are misclassified by all feature sets. Considering the non-overlapping predictions, the percentage of correct predictions was higher for character 3-grams than for either document embeddings or psycho-linguistic features in a pair-wise comparison. Thus, it appears that adding these features would be more detrimental than beneficial to narrative classification.

Influential features

Narratives are typically distinguished by terms relating to the past tense (was, had, years), health (imatinib, tumor, surgeri ) and first-person narrative (my, i ) (see Figure 1). This is corroborated by the character 3-grams, psycho-linguistic features and document embeddings. Some of the important terms for non-narrative texts are also health-related (patients, gist) and first-person narrative (we, us), which showcases the difficulty of the task at hand. In general, non-narrative texts seem to focus more on emotional support (prayer, share, may), secondperson narrative (you, your ) and the future (may, will ). The psycho-linguistic features additionally reveal that narratives contain more mentions of causality and negative emotions. In contrast, non-narrative texts seem to contain more positive emotions. Lastly, as predicted, function words appear important for classifying narratives in social media, and it is thus advisable to not remove stopwords.

Error analysis for the best performing classifier

Error analysis reveals that a significant proportion of the errors is due to incorrect annotation: 36.9% of the false positives and 36.2% of the false negatives were labelled incorrectly (see Table 4). Specifically, annotators have difficulty correctly labelling discussions about personal medical facts or side effects as narratives (e.g 'i have been on imatinib 5 months and lost 1/3 of my hair' ). Conversely, annotators may incorrectly judge posts that give emotional support, external information or advice to be narratives while they are not (e.g. 'i may be wrong but total gastrectomy sounds very extreme for two small gist' ).

The incorrect labelling may have impacted the automated classification such that these categories are also more difficult for the computer to distinguish. The classifier does, however, appear to outperform human judgement and to some extent 'correct' their mistakes. In fact, its performance may be underestimated by the metrics based on these incorrect labels. Other types of posts that appears challenging for the computer are posts that lack context or contain questions. The former are often answers to unknown questions posed earlier in the thread.

Frequency and content of patient narratives

Automated narrative detection in unsupervised data

The percentage of narratives in the unlabelled data is 37.0 %, which is comparable to the annotated sample. This results in a total of 13.436 posts for topic modelling.6

Topic modelling

The TC-W2V metric [OGCC15] identifies the optimal number of topics to be fourteen. The resulting topics relate to different aspects of the medical process for GIST patients (see Table 5). Note that imatinib is the most commonly used medication.

Discussion

The detection of narratives was most optimal when using character 3-grams. Their strength is in their ability to cluster relevant word types based on suffixes and prefixes. This is especially relevant in the medical domain e.g. all cancer medication for GIST ends in 'nib'. contrast, psycho-linguistic features appear to suffer from oversimplification, because they aggregate words that define different classes into one category e.g. we and my into the umbrella category of first person pronouns (see Figure 1). The use of document embeddings may have been hampered by the small size of the data. An alternative explanation could be that incorrect labelling impacts these features more strongly than word-based features.

Narratives could be differentiated most strongly by their use of past tense, first-person narrative and healthrelated words. The first two are in line with linguistic definition of a narrative. The stronger focus on health, however, may indicate that patients prefer to share their own health experiences than health information from external sources.

Annotating narratives appears a challenging task, despite providing annotators with a guideline based on previous work [VBSEng] and validated through initial annotation by two annotators. This is underscored by our inter-annotator agreement (κ = 0.69) which was comparable to that of Verberne et al. [VBSEng] (κ = 0.71). Our classifier performed less well that their system (F 1 = 0.91), which may be explained by their larger sample of annotated data (2.051 posts).

Inevitably, our results depend on the choice of what constitutes a narrative and how annotators interpret this definition. It appears that especially the line between a medical fact about oneself and a medical experience is fuzzy for annotators. Future studies could perhaps use this knowledge to develop clearer guidelines.

Conclusion

For the detection of patient narratives on social media, psycho-linguistic features and document embeddings are outperformed by character 3-grams. These narratives are associated with the past tense, health and first-person pronouns, whereas non-narrative text is associated with the future tense, emotional support and second-person pronouns. The patient narratives could be subdivided into discussions of fourteen different medical topics, ranging from surgery to side effects. Future work will develop automated methods for the extraction of patient knowledge from the narratives.

Figure 1 :1Figure 1: The 20 Most Influential Features In Individual Classifiers. In (b) underscores represent spaces.

Table 1 :1Mean Test Score (10-fold CV) For Best Classifiers Per Feature SetFeature setSize ClassifierF1 (SD)Recall (SD)Precision (SD)Original4,078 SGD0.795 (0.025)0.788 (0.074)0.811 (0.055)UnigramsStemmed3,205 SGD0.814 (0.031)0.793 (0.047)0.840 (0.049)Lemmatised3,777 SGD0.808 (0.039)0.810 (0.059)0.813 (0.070)3-grams5,086 SVC0.815 (0.035) 0.844 (0.047)0.793 (0.058)Character n-grams4-grams 5-grams16,496 SVC 36,349 SGD/SVC0.811 (0.027) 0.796 (0.023)0.827 (0.068) 0.784 (0.059)0.844 (0.029) 0.817 (0.069)6-grams60,443 SGD0.793 (0.040)0.797 (0.042)0.795 (0.079)LIWC82 SVC0.773 (0.031)0.805 (0.044)0.752 (0.077)DBOW400 LogReg0.737 (0.029)0.751 (0.056)0.735 (0.066)Doc2vecDM400 LogReg0.762 (0.039)0.749 (0.062)0.785 (0.070)DM+DBOW800 LogReg0.772 (0.037)0.803 (0.064)0.749 (0.055)

Table 2 :2Mean Test Score (10-fold CV) For Ensemble Classification. * DM+DBOW variant.Feature sets

Table 3 :3Comparison of Predictions of Classifiers for Different Feature Sets. * DM+DBOW variant.BothDifference

Table 4 :4Error Analysis for best classifier (Character 3-gram Classification of Narratives)False positivesFalse negativesReasons for misclassificationFrequency Reasons for misclassificationFrequencyMislabelling24Mislabelling17Emotional support/thanks15Unknown12Information/advice13Lack of context7Lack of context7Question5Question4Non-medical narratives3Unknown1Hypothetical1Empty post1Empty post2TOTAL65TOTAL47

Table 5 :5Most Important Topics Discussed In Patient Forum Narratives. Topic labels were assigned manually. * Cancer medicationTopic labelsTop 10 wordsTop-ranked post for the topicTumor locationtumor stomach removed liver small cm'i only had one tumor on my stomach'mitotic metastases rate intestine(Emotional) Copingtake get time doctor like also know ima-'i completely understand i started 400 imatinib aftertinib* day wouldsurgery in and have lots of bad days [...]'Duration of Treat-years imatinib* almost ago 10 taking'about 1 and 1/2 years'menttwo still 11 12Types of Scansscan ct pet results next today last'oops one is a ct scan and one is a pet scan'showed week catDiagnosis of GISTgist diagnosed cancer specialist oncolo-'that was my gist'gist husband anyone ago surgeon foundOther Medicationsunitinib* regorafenib* sorafenib* ima-'i have this on sunitinib'tinib* working 37 exon nilotinib* trialstopped drugSide Effectsside effects imatinib* effect different fa-'and no side-effects'tigue eyes bad 400mg timeTumor Surgerysurgery remove since weeks first post'just had surgery'surgeon second shrink doneAbsence of Tumordisease evidence still years today post'no evidence of disease no evidence of disease'Recurrencesince resection year farRecurrence of Work,back came come hair go went weeks'i started imatinib after i went back to work'Medication or Tumortook coming lostEmotional supportgood luck news best far hope bad goes'all my best and good luck'well keep prettyDosage of Medication mg 400 800 imatinib* 600 take day tak-'11 years of imatinib since 2003 at 600 mg and sinceing since startednovember 2009 at 800 mg [...]'Timing of Scansmonths every scans three ct six year'my doctor said 3 years'two first monthIngesting imatinibone year last took imatinib* day an-'take imatinib'other old got time

https://www.facebook.com/groups/gistsupport/ The preprocessing scripts can be found at: https://github.com/AnneDirkson/LexNorm https://bitbucket.org/asarker/simplenormalizerscripts The annotation guidelines can be found at: https://github.com/AnneDirkson/NarrativeFilter The annotated data is available upon request in order to protect the privacy of the patients The code for unsupervised narrative filtering is shared at: https://github.com/AnneDirkson/NarrativeFilter

Acknowledgements

This work was financed by the SIDN fonds. The authors also thank H. Vos, G. Wiggers, W. Verschoof, A. Brandsen, D. Gawehns, P. Dhar, M. Vinkenoog and G. van Oortmerssen of Leiden University for annotating the data.

Empowerment of patients: lessons from the rare diseases community SégolèneAymé AnnaKole StephenGroft The Lancet 371 9629. 2008 Experiential Knowledge: A New Concept for the Analysis of Self-Help Groups ThomasinaBorkman Social Service Review 50 3 1976 Mobilising the experiential knowledge of clinicians, patients and carers for applied health-care research JiangBian FanYu PamCarter RogerBeech DomenicaCoxon MartinJThomas ClareJinks SHB12 2012. 2013 8 Towards Large-scale Twitter Mining for Drug-related Adverse Events Did you really just have a heart attack? PayamKarisani EugeneAgichtein Proceedings of the 2018 World Wide Web Conference on World Wide Web -WWW 18 the 2018 World Wide Web Conference on World Wide Web -WWW 18 2018 langid.py: An Off-the-shelf Language Identification Tool MarcoLui TimothyBaldwin Proceedings of the 50th annual meeting of the association of computational linguistics the 50th annual meeting of the association of computational linguistics 2012 Distributed Representations of Sentences and Documents QuocLe TomasMikolov Proceedings of the 31st intrenational conference on machine learning the 31st intrenational conference on machine learning 2014 Separating Fact from Fear: Tracking Flu Infections on Twitter AlexLamb MichaelJPaul MarkDredze Proceedings of NAACL-HLT NAACL-HLT 2013 Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features AzadehNikfarjam AbeedSarker O'Karen RachelConnor GracielaGinn Gonzalez Journal of the American Medical Informatics Association: JAMIA 22 3 2015 National Library of Medicine An analysis of the coherence of descriptors in topic modeling DerekDerek O'callaghan JoeGreene PádraigCarthy Cunningham Expert Systems with Applications 42 13 2015 A customizable pipeline for social media text normalization AbeedSarker Social Network Analysis and Mining 7 45 2017 Portable automatic text classification for adverse drug reaction detection via multi-corpus training AbeedSarker GracielaGonzalez ;Sarker RachelGinn AzadehNikfarjam Karen O'Connor KarenSmith SwethaJayaraman TejaswiUpadhaya GracielaGonzalez Journal of Biomedical Informatics 53 2015. 2015 Journal of Biomedical Informatics Social media use in healthcare: A systematic review of effects on patients and on their relationship with healthcare professionals EdinSmailhodzic WyandaHooijsma AlbertBoonstra DavidJLangley Karen O'Sarker RachelConnor MatthewGinn KarenScotch DanSmith GracielaMalone Gonzalez BMC Health Services Research 16 1 2016. 2016 Drug Safety The psychological meaning of words: LIWC and computerized text analysis methods RYla JamesWTausczik Pennebaker Journal of Language and Social Psychology 29 1 2010 Social processes of online empowerment on a cancer patient discussion form: using text mining to analyze linguistic patterns of empowerment processes SuzanVerberne AnikaBatenburg RemcoSanders MiesVan Eenbergen ; Cornelia FVan Uden-Kraan ConstanceHDrossaert ErikTaal BretRShaw ErwinRSeydel MartA F JVan De Laar ; Cornelia ; Kraan ConstanceH CDrossaert ErikTaal ErwinRSeydel MartA F JVan De Laar JMIR Cancer 2008. 2009 18 Qualitative Health Research