<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Narrative detection in online patient communities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anne</forename><surname>Dirkson</surname></persName>
							<email>a.r.dirkson@liacs.leidenuniv.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Leiden Institute of Advanced Computer Science</orgName>
								<orgName type="institution">Leiden University Niels Bohrweg</orgName>
								<address>
									<postCode>2333 CA</postCode>
									<settlement>Leiden</settlement>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Suzan</forename><surname>Verberne</surname></persName>
							<email>s.verberne@liacs.leidenuniv.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Leiden Institute of Advanced Computer Science</orgName>
								<orgName type="institution">Leiden University Niels Bohrweg</orgName>
								<address>
									<postCode>2333 CA</postCode>
									<settlement>Leiden</settlement>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wessel</forename><surname>Kraaij</surname></persName>
							<email>w.kraaij@liacs.leidenuniv.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Leiden Institute of Advanced Computer Science</orgName>
								<orgName type="institution">Leiden University Niels Bohrweg</orgName>
								<address>
									<postCode>2333 CA</postCode>
									<settlement>Leiden</settlement>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Narrative detection in online patient communities</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9D6BE848024699CC068B5D2E737B73EC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T00:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Although narratives on patient forums are a valuable source of medical information, their systematic detection and analysis has so far been limited to a single study. In this study, we examine whether psycholinguistic features or document embeddings can aid identification of narratives. We also investigate which features distinguish narratives from other social media posts. This study is the first to automatically identify the topics discussed in narratives on a patient forum. Our results show that for classifying narratives, character 3-grams outperform psycho-linguistic features and document embeddings. We found that narratives are characterized by the use of past tense, health-related words and first-person pronouns, whereas non-narrative text is associated with the future tense, emotional support words and second-person pronouns. Topic analysis of the patient narratives uncovered fourteen different medical topics, ranging from tumor surgery to side effects. Future work will use these methods to extract experiential patient knowledge from social media.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Nowadays, online patient forums are the main medium by which patients exchange their narratives. These narratives mainly recount their own experiences with their condition. As such, they contain experiential knowledge <ref type="bibr" target="#b1">[Bor76]</ref>, defined as the knowledge that patients gain from their own experiences. In recent years, such experiential knowledge has increasingly been recognized as valuable and complementary to empirical knowledge [CBC + 13]. Consequently, more health-related applications are making use of patient forum data, for instance to track public health trends [SOG + 16] and to detect adverse drug responses [SGN + 15]. Experiential knowledge is also valuable for patients themselves: patients indicate that they strongly rely on experiences and information provided on patient forums <ref type="bibr" target="#b11">[SHBL16]</ref>. This is especially true for patients with a rare disease, for which medical professionals often lack expertise and the number of studies is limited <ref type="bibr" target="#b0">[AKG08]</ref>.</p><p>To understand the experiential knowledge on patient forums, forum posts that contain narratives must first be identified. As of yet, research into systematically distinguishing patient narratives on patient forums is limited to a single study on Dutch forum data <ref type="bibr">[VBSEng]</ref>, which uses words as only features. We expand upon this work using a different data set by examining whether document embeddings and psycho-linguistic features can improve the identification of patient narratives. We expect so, because these aggregated features are less dependent on individual terms, which may overlap significantly between narratives and factual statements about the same topic. Secondly, we explore how narratives differ from other types of posts by studying which features are influential in identifying narratives and which posts are classified incorrectly. Thirdly, we analyze how prevalent narratives are on a cancer patient forum and which topics these narratives discuss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Narratives on patient forums have mainly been studied qualitatively (e.g. [vUKDT + 09]). The automatic identification of narratives on a patient forum is limited to the study by Verberne et al. <ref type="bibr">[VBSEng]</ref> on a Dutch cancer forum. They identified narratives with a F 1 of 0.911 using only the lower-cased words of the posts as features. They also found that various linguistic factors (1st person singular, 3rd person and negations) and psychological processes (social processes and religion) were correlated with the presence of narratives. These psycho-linguistic features were measured using the Linguistic Inquiry and Word Count (LIWC) method <ref type="bibr" target="#b12">[TP10]</ref>.</p><p>Additionally, research into self-reported adverse drug responses (ADRs) has led to the development of classifiers for differentiating between factual statements of ADRs and personal experiences of ADRs on social media [BY12, NSO + 15, SG15]. However, these classifiers are highly specific and thus not suitable for identifying patient narratives in general.</p><p>Another closely related field is the classification of personal health mentions on social media i.e. posts that mention a person who is affected as well as their specific condition, such as: 'my granddad has Alzheimer's'. Presently, only two studies have investigated this task. The first by Lamb et al. <ref type="bibr" target="#b6">[LPD13]</ref> focused on separating flu awareness from actual flu reports on social media. More recently, Karisani et al. <ref type="bibr" target="#b3">[KA18]</ref> introduced WESPAD, a classifier for personal health mentions, which attains state-of-the-art performance for seven different health domains including stroke, depression and flu infection. Nonetheless, a personal health mention alone is not sufficient to consider the post a narrative, and thus these classifiers are also inadequate for our purpose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data</head><p>Our data consists of an open, international Facebook forum for patients with Gastrointestinal Stromal Tumor (GIST)<ref type="foot" target="#foot_0">1</ref> . It is moderated by GIST Support International and consists of 36,722 posts with a median length of 20 tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Preprocessing</head><p>The data was lowercased and tokenized with NLTK. Due to the noisy nature of user-generated content, especially in the spelling of medical terms, we applied a tailored preprocessing pipeline<ref type="foot" target="#foot_1">2</ref> to our data. Firstly, an existing normalization pipeline for social media <ref type="bibr" target="#b9">[Sar17]</ref> <ref type="foot" target="#foot_2">3</ref> was used to normalize tokens to American English and to expand generic abbreviations used on social media. Hereafter, domain-specific abbreviations were expanded with a lexicon of 42 non-ambiguous abbreviations, generated based on 1000 posts and annotated by a domain expert and the first author. Spelling mistakes were detected using a combination of relative frequency and edit distance to possible candidates and corrected using weighted Levenshtein distance. Correction candidates were derived from the corpus itself. Drug names were normalized using the RxNorm database <ref type="bibr">[Nat]</ref>. Non-English posts were removed using langid <ref type="bibr" target="#b4">[LB12]</ref>. Punctuation was removed, but stop words were not, as we expect function words to play a role in the expression of narratives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Supervised classification</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Manual annotation of example data</head><p>We randomly selected 1050 posts for annotation. The annotators were asked to indicate per message whether it contains a personal experience. They were not provided with its context. Personal experiences did not need to be about the author but could be about someone else. This definition was based on earlier work by Verberne et al. <ref type="bibr">[VBSEng]</ref> and van Uden-Kraan et al. [vUKDT + 08]. The first 50 posts were annotated individually by the first author and another PhD student to improve the annotation guidelines. <ref type="foot" target="#foot_3">4</ref> The remaining 1000 posts were divided equally into six sets of 200 posts, with 40 posts (20%) overlapping between all sets. The overlap was used to calculate the pairwise Cohen's kappa. There were seven annotators in total: six PhD students and one GIST patient. Each sample was assigned to an annotator, apart from one sample which was divided between two PhD students. To be able to include the overlapping sample in the classification, we opted to use the annotations of the GIST patient for these 40 posts.<ref type="foot" target="#foot_4">5</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Feature sets</head><p>Four feature sets were derived from the text data: word unigrams, character n-grams (using the CountVectorizer function in sklearn), psycho-linguistic features, and document embeddings. For both word unigrams and character n-grams, we investigated whether TF-IDF weighting would improve performance compared to raw counts. Additionally, we explored whether stemming or lemmatising the data prior to extracting the unigrams could improve performance. Psycho-linguistic features were based on the LIWC 2015 <ref type="bibr" target="#b12">[TP10]</ref>. Punctuation categories were discarded, resulting in 82 LIWC features in total. LIWC is a well-known method for investigating psychological processes in text and includes both linguistic (e.g. first-person pronouns) and psychological categories (e.g. positive emotions). The last feature set consisted of document embeddings: a doc2vec model <ref type="bibr" target="#b5">[LM14]</ref> was trained on the labeled training data for each fold in the cross-validation. We combine a distributed memory model with a distributed bag of words model, as recommended by Le and Mikolov <ref type="bibr" target="#b5">[LM14]</ref>. We also attempted to train document embeddings first on the unsupervised data and then re-train on the supervised data, but this led to nonsensical classification features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3">Supervised classification algorithms</head><p>Classifiers were evaluated separately for each feature set. We ignored all posts that had been left empty by the annotator (the annotator chose neither yes nor no): three posts were ignored for this reason. For word unigrams, character n-grams and psycho-linguistic features, we compared four sklearn classification algorithms: Multinomial Naive Bayes (MNB), linear Support Vector Classification (LinearSVC), Stochastic Gradient Descent (SGD) with log loss, and K Nearest Neighbours (KNN). These were chosen according to the following criteria: (1) known to perform well on text data, (2) recommended for small data sets and (3) able to calculate probabilistic outcomes. The latter enabled us to use probabilistic ensembles. The doc2vec representations combined with Logistic Regression were used as classifier in itself: the document representations were tagged with the labels of the training data. This model was then used to derive vector representations for new documents. To test if a combination of feature types could improve performance, we evaluated soft voting (argmax of the sums of the predicted probabilities) of the best individual classifiers for the best performing variants of each feature set. Significance testing was done with pair-wise t-tests.</p><p>To evaluate the performance, the average F 1 score of a 10-fold cross validation was used. For each run, hyper-parameters were tuned for that specific training set using a 10-fold grid search on the training data. The tuning grids were based on sklearn documentation: C from 10 -3 to 10 3 (steps of x10) for LinearSVC and Logistic Regression; number of neighbors from 3 to 11 (steps of 2) for KNN; and max iterations from 2 to 2048 (steps of x2) and alpha from 10 -8 to 10 -2 (steps of x10) for SGD. The dimensionality of the document vectors was tuned with a grid of 100 to 400 (steps of 100).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Topic modelling of the whole data set</head><p>To label the remaining data, the best performing classifier was used with the hyper-parameter settings that were optimal in the majority of the training sets. To investigate which topics are discussed in the patient narratives, we used topic modelling with non-Negative Matrix Factorization of the TF-IDF weighted tokens without stopwords. Topic coherence, measured using TC-W2V <ref type="bibr" target="#b8">[OGCC15]</ref>, was used to select the number of topics. Topic labels were assigned manually by exploring the words with the highest weights and the top-ranked (i.e. most relevant) messages per topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Annotated data</head><p>The data was slightly imbalanced, with 37.7% of the posts containing a narrative, resulting in a majority baseline of roughly 0.62. The inter-annotator agreement was substantial (κ = 0.69).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Classifier evaluation</head><p>A Linear SVC on character 3-grams achieves the highest F 1 score (Table <ref type="table" target="#tab_0">1</ref>), although character 4-grams (p = 0.526), stemmed unigrams (p = 0.930) and lemmatised unigrams (p = 0.587) do not perform significantly worse. Character 5-and 6-grams also do not perform worse overall (p = 0.122 and p = 0.169), but their recall is significantly lower (p = 0.023 and p = 0.029). The classifiers for the best performing document embeddings (DBOW+DM) and psycho-linguistic features, however, are significantly worse overall than character 3-grams (p = 0.0055 and p = 0.026 respectively). Employing TF-IDF weighting does not aid any of the unigram or character n-gram features. Additionally, neither feature selection (F 1 =0.761) nor word boundaries (F 1 =0.796) improve the performance of character 3-grams. Using a range of character n-grams, namely 3-to-4 (F 1 =0.814), 3-to-5 (F 1 =0.814), or 3-to-6 (F 1 =0.812), also does not boost performance.</p><p>Ensemble classification did not perform better than character 3-grams alone (see Table <ref type="table" target="#tab_1">2</ref>). Nevertheless, an ensemble of all four feature types is significantly more precise than all other classifiers (p = 0.0048 compared to the second best). To further explore why ensemble classification does not manage to improve overall performance, we investigated the predictions of individual classifiers. As can be seen in Table <ref type="table" target="#tab_2">3</ref>, there is a high degree of overlap between the predictions based on character 3-grams and the other feature sets (88.3%, 83.8% and 84.4% respectively). Consequently, the vast majority of the predictions cannot be improved by complementing character 3-grams with these feature sets. Interestingly, 4.7% of the posts are misclassified by all feature sets. Considering the non-overlapping predictions, the percentage of correct predictions was higher for character 3-grams than for either document embeddings or psycho-linguistic features in a pair-wise comparison. Thus, it appears that adding these features would be more detrimental than beneficial to narrative classification.    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Influential features</head><p>Narratives are typically distinguished by terms relating to the past tense (was, had, years), health (imatinib, tumor, surgeri ) and first-person narrative (my, i ) (see Figure <ref type="figure" target="#fig_0">1</ref>). This is corroborated by the character 3-grams, psycho-linguistic features and document embeddings. Some of the important terms for non-narrative texts are also health-related (patients, gist) and first-person narrative (we, us), which showcases the difficulty of the task at hand. In general, non-narrative texts seem to focus more on emotional support (prayer, share, may), secondperson narrative (you, your ) and the future (may, will ). The psycho-linguistic features additionally reveal that narratives contain more mentions of causality and negative emotions. In contrast, non-narrative texts seem to contain more positive emotions. Lastly, as predicted, function words appear important for classifying narratives in social media, and it is thus advisable to not remove stopwords.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Error analysis for the best performing classifier</head><p>Error analysis reveals that a significant proportion of the errors is due to incorrect annotation: 36.9% of the false positives and 36.2% of the false negatives were labelled incorrectly (see Table <ref type="table" target="#tab_3">4</ref>). Specifically, annotators have difficulty correctly labelling discussions about personal medical facts or side effects as narratives (e.g 'i have been on imatinib 5 months and lost 1/3 of my hair' ). Conversely, annotators may incorrectly judge posts that give emotional support, external information or advice to be narratives while they are not (e.g. 'i may be wrong but total gastrectomy sounds very extreme for two small gist' ).</p><p>The incorrect labelling may have impacted the automated classification such that these categories are also more difficult for the computer to distinguish. The classifier does, however, appear to outperform human judgement and to some extent 'correct' their mistakes. In fact, its performance may be underestimated by the metrics based on these incorrect labels. Other types of posts that appears challenging for the computer are posts that lack context or contain questions. The former are often answers to unknown questions posed earlier in the thread.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Frequency and content of patient narratives</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.1">Automated narrative detection in unsupervised data</head><p>The percentage of narratives in the unlabelled data is 37.0 %, which is comparable to the annotated sample. This results in a total of 13.436 posts for topic modelling.<ref type="foot" target="#foot_5">6</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.2">Topic modelling</head><p>The TC-W2V metric <ref type="bibr" target="#b8">[OGCC15]</ref> identifies the optimal number of topics to be fourteen. The resulting topics relate to different aspects of the medical process for GIST patients (see Table <ref type="table" target="#tab_4">5</ref>). Note that imatinib is the most commonly used medication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>The detection of narratives was most optimal when using character 3-grams. Their strength is in their ability to cluster relevant word types based on suffixes and prefixes. This is especially relevant in the medical domain e.g. all cancer medication for GIST ends in 'nib'. contrast, psycho-linguistic features appear to suffer from oversimplification, because they aggregate words that define different classes into one category e.g. we and my into the umbrella category of first person pronouns (see Figure <ref type="figure" target="#fig_0">1</ref>). The use of document embeddings may have been hampered by the small size of the data. An alternative explanation could be that incorrect labelling impacts these features more strongly than word-based features.</p><p>Narratives could be differentiated most strongly by their use of past tense, first-person narrative and healthrelated words. The first two are in line with linguistic definition of a narrative. The stronger focus on health, however, may indicate that patients prefer to share their own health experiences than health information from external sources.</p><p>Annotating narratives appears a challenging task, despite providing annotators with a guideline based on previous work <ref type="bibr">[VBSEng]</ref> and validated through initial annotation by two annotators. This is underscored by our inter-annotator agreement (κ = 0.69) which was comparable to that of Verberne et al. <ref type="bibr">[VBSEng]</ref> (κ = 0.71). Our classifier performed less well that their system (F 1 = 0.91), which may be explained by their larger sample of annotated data (2.051 posts).</p><p>Inevitably, our results depend on the choice of what constitutes a narrative and how annotators interpret this definition. It appears that especially the line between a medical fact about oneself and a medical experience is fuzzy for annotators. Future studies could perhaps use this knowledge to develop clearer guidelines. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>For the detection of patient narratives on social media, psycho-linguistic features and document embeddings are outperformed by character 3-grams. These narratives are associated with the past tense, health and first-person pronouns, whereas non-narrative text is associated with the future tense, emotional support and second-person pronouns. The patient narratives could be subdivided into discussions of fourteen different medical topics, ranging from surgery to side effects. Future work will develop automated methods for the extraction of patient knowledge from the narratives.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The 20 Most Influential Features In Individual Classifiers. In (b) underscores represent spaces.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Mean Test Score (10-fold CV) For Best Classifiers Per Feature Set</figDesc><table><row><cell>Feature set</cell><cell></cell><cell>Size Classifier</cell><cell>F1 (SD)</cell><cell>Recall (SD)</cell><cell>Precision (SD)</cell></row><row><cell></cell><cell>Original</cell><cell>4,078 SGD</cell><cell>0.795 (0.025)</cell><cell>0.788 (0.074)</cell><cell>0.811 (0.055)</cell></row><row><cell>Unigrams</cell><cell>Stemmed</cell><cell>3,205 SGD</cell><cell>0.814 (0.031)</cell><cell>0.793 (0.047)</cell><cell>0.840 (0.049)</cell></row><row><cell></cell><cell>Lemmatised</cell><cell>3,777 SGD</cell><cell>0.808 (0.039)</cell><cell>0.810 (0.059)</cell><cell>0.813 (0.070)</cell></row><row><cell></cell><cell>3-grams</cell><cell>5,086 SVC</cell><cell cols="2">0.815 (0.035) 0.844 (0.047)</cell><cell>0.793 (0.058)</cell></row><row><cell>Character n-grams</cell><cell>4-grams 5-grams</cell><cell>16,496 SVC 36,349 SGD/SVC</cell><cell>0.811 (0.027) 0.796 (0.023)</cell><cell>0.827 (0.068) 0.784 (0.059)</cell><cell>0.844 (0.029) 0.817 (0.069)</cell></row><row><cell></cell><cell>6-grams</cell><cell>60,443 SGD</cell><cell>0.793 (0.040)</cell><cell>0.797 (0.042)</cell><cell>0.795 (0.079)</cell></row><row><cell>LIWC</cell><cell></cell><cell>82 SVC</cell><cell>0.773 (0.031)</cell><cell>0.805 (0.044)</cell><cell>0.752 (0.077)</cell></row><row><cell></cell><cell>DBOW</cell><cell>400 LogReg</cell><cell>0.737 (0.029)</cell><cell>0.751 (0.056)</cell><cell>0.735 (0.066)</cell></row><row><cell>Doc2vec</cell><cell>DM</cell><cell>400 LogReg</cell><cell>0.762 (0.039)</cell><cell>0.749 (0.062)</cell><cell>0.785 (0.070)</cell></row><row><cell></cell><cell>DM+DBOW</cell><cell>800 LogReg</cell><cell>0.772 (0.037)</cell><cell>0.803 (0.064)</cell><cell>0.749 (0.055)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Mean Test Score (10-fold CV) For Ensemble Classification. * DM+DBOW variant.</figDesc><table><row><cell>Feature sets</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Comparison of Predictions of Classifiers for Different Feature Sets. * DM+DBOW variant.</figDesc><table><row><cell>Both</cell><cell>Difference</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Error Analysis for best classifier (Character 3-gram Classification of Narratives)</figDesc><table><row><cell>False positives</cell><cell></cell><cell>False negatives</cell><cell></cell></row><row><cell>Reasons for misclassification</cell><cell cols="2">Frequency Reasons for misclassification</cell><cell>Frequency</cell></row><row><cell>Mislabelling</cell><cell>24</cell><cell>Mislabelling</cell><cell>17</cell></row><row><cell>Emotional support/thanks</cell><cell>15</cell><cell>Unknown</cell><cell>12</cell></row><row><cell>Information/advice</cell><cell>13</cell><cell>Lack of context</cell><cell>7</cell></row><row><cell>Lack of context</cell><cell>7</cell><cell>Question</cell><cell>5</cell></row><row><cell>Question</cell><cell>4</cell><cell>Non-medical narratives</cell><cell>3</cell></row><row><cell>Unknown</cell><cell>1</cell><cell>Hypothetical</cell><cell>1</cell></row><row><cell>Empty post</cell><cell>1</cell><cell>Empty post</cell><cell>2</cell></row><row><cell>TOTAL</cell><cell>65</cell><cell>TOTAL</cell><cell>47</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 :</head><label>5</label><figDesc>Most Important Topics Discussed In Patient Forum Narratives. Topic labels were assigned manually. * Cancer medication</figDesc><table><row><cell>Topic labels</cell><cell>Top 10 words</cell><cell>Top-ranked post for the topic</cell></row><row><cell>Tumor location</cell><cell>tumor stomach removed liver small cm</cell><cell>'i only had one tumor on my stomach'</cell></row><row><cell></cell><cell>mitotic metastases rate intestine</cell><cell></cell></row><row><cell>(Emotional) Coping</cell><cell>take get time doctor like also know ima-</cell><cell>'i completely understand i started 400 imatinib after</cell></row><row><cell></cell><cell>tinib* day would</cell><cell>surgery in and have lots of bad days [...]'</cell></row><row><cell>Duration of Treat-</cell><cell>years imatinib* almost ago 10 taking</cell><cell>'about 1 and 1/2 years'</cell></row><row><cell>ment</cell><cell>two still 11 12</cell><cell></cell></row><row><cell>Types of Scans</cell><cell>scan ct pet results next today last</cell><cell>'oops one is a ct scan and one is a pet scan'</cell></row><row><cell></cell><cell>showed week cat</cell><cell></cell></row><row><cell>Diagnosis of GIST</cell><cell>gist diagnosed cancer specialist oncolo-</cell><cell>'that was my gist'</cell></row><row><cell></cell><cell>gist husband anyone ago surgeon found</cell><cell></cell></row><row><cell>Other Medication</cell><cell>sunitinib* regorafenib* sorafenib* ima-</cell><cell>'i have this on sunitinib'</cell></row><row><cell></cell><cell>tinib* working 37 exon nilotinib* trial</cell><cell></cell></row><row><cell></cell><cell>stopped drug</cell><cell></cell></row><row><cell>Side Effects</cell><cell>side effects imatinib* effect different fa-</cell><cell>'and no side-effects'</cell></row><row><cell></cell><cell>tigue eyes bad 400mg time</cell><cell></cell></row><row><cell>Tumor Surgery</cell><cell>surgery remove since weeks first post</cell><cell>'just had surgery'</cell></row><row><cell></cell><cell>surgeon second shrink done</cell><cell></cell></row><row><cell>Absence of Tumor</cell><cell>disease evidence still years today post</cell><cell>'no evidence of disease no evidence of disease'</cell></row><row><cell>Recurrence</cell><cell>since resection year far</cell><cell></cell></row><row><cell>Recurrence of Work,</cell><cell>back came come hair go went weeks</cell><cell>'i started imatinib after i went back to work'</cell></row><row><cell>Medication or Tumor</cell><cell>took coming lost</cell><cell></cell></row><row><cell>Emotional support</cell><cell>good luck news best far hope bad goes</cell><cell>'all my best and good luck'</cell></row><row><cell></cell><cell>well keep pretty</cell><cell></cell></row><row><cell cols="2">Dosage of Medication mg 400 800 imatinib* 600 take day tak-</cell><cell>'11 years of imatinib since 2003 at 600 mg and since</cell></row><row><cell></cell><cell>ing since started</cell><cell>november 2009 at 800 mg [...]'</cell></row><row><cell>Timing of Scans</cell><cell>months every scans three ct six year</cell><cell>'my doctor said 3 years'</cell></row><row><cell></cell><cell>two first month</cell><cell></cell></row><row><cell>Ingesting imatinib</cell><cell>one year last took imatinib* day an-</cell><cell>'take imatinib'</cell></row><row><cell></cell><cell>other old got time</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.facebook.com/groups/gistsupport/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The preprocessing scripts can be found at: https://github.com/AnneDirkson/LexNorm</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://bitbucket.org/asarker/simplenormalizerscripts</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The annotation guidelines can be found at: https://github.com/AnneDirkson/NarrativeFilter</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">The annotated data is available upon request in order to protect the privacy of the patients</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">The code for unsupervised narrative filtering is shared at: https://github.com/AnneDirkson/NarrativeFilter</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Acknowledgements</head><p>This work was financed by the SIDN fonds. The authors also thank H. Vos, G. Wiggers, W. Verschoof, A. Brandsen, D. Gawehns, P. Dhar, M. Vinkenoog and G. van Oortmerssen of Leiden University for annotating the data.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Empowerment of patients: lessons from the rare diseases community</title>
		<author>
			<persName><forename type="first">Ségolène</forename><surname>Aymé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Kole</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Groft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Lancet</title>
		<imprint>
			<biblScope unit="volume">371</biblScope>
			<biblScope unit="page" from="2048" to="2051" />
			<date type="published" when="2008">9629. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Experiential Knowledge: A New Concept for the Analysis of Self-Help Groups</title>
		<author>
			<persName><forename type="first">Thomasina</forename><surname>Borkman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Social Service Review</title>
		<imprint>
			<biblScope unit="volume">50</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="445" to="456" />
			<date type="published" when="1976">1976</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Mobilising the experiential knowledge of clinicians, patients and carers for applied health-care research</title>
		<author>
			<persName><forename type="first">Jiang</forename><surname>Bian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fan</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pam</forename><surname>Carter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roger</forename><surname>Beech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Domenica</forename><surname>Coxon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martin</forename><forename type="middle">J</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Clare</forename><surname>Jinks</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SHB12</title>
				<imprint>
			<date type="published" when="2012">2012. 2013</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="307" to="320" />
		</imprint>
	</monogr>
	<note>Towards Large-scale Twitter Mining for Drug-related Adverse Events</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Did you really just have a heart attack?</title>
		<author>
			<persName><forename type="first">Payam</forename><surname>Karisani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eugene</forename><surname>Agichtein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 World Wide Web Conference on World Wide Web -WWW 18</title>
				<meeting>the 2018 World Wide Web Conference on World Wide Web -WWW 18</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">langid.py: An Off-the-shelf Language Identification Tool</title>
		<author>
			<persName><forename type="first">Marco</forename><surname>Lui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 50th annual meeting of the association of computational linguistics</title>
				<meeting>the 50th annual meeting of the association of computational linguistics</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="25" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Distributed Representations of Sentences and Documents</title>
		<author>
			<persName><forename type="first">Quoc</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st intrenational conference on machine learning</title>
				<meeting>the 31st intrenational conference on machine learning</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Separating Fact from Fear: Tracking Flu Infections on Twitter</title>
		<author>
			<persName><forename type="first">Alex</forename><surname>Lamb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><forename type="middle">J</forename><surname>Paul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Dredze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NAACL-HLT</title>
				<meeting>NAACL-HLT</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="789" to="795" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features</title>
		<author>
			<persName><forename type="first">Azadeh</forename><surname>Nikfarjam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abeed</forename><surname>Sarker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O'</forename><surname>Karen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rachel</forename><surname>Connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Graciela</forename><surname>Ginn</surname></persName>
		</author>
		<author>
			<persName><surname>Gonzalez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association: JAMIA</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="671" to="681" />
			<date type="published" when="2015">2015</date>
		</imprint>
		<respStmt>
			<orgName>National Library of Medicine</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An analysis of the coherence of descriptors in topic modeling</title>
		<author>
			<persName><forename type="first">Derek</forename><surname>Derek O'callaghan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joe</forename><surname>Greene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pádraig</forename><surname>Carthy</surname></persName>
		</author>
		<author>
			<persName><surname>Cunningham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">13</biblScope>
			<biblScope unit="page" from="5645" to="5657" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A customizable pipeline for social media text normalization</title>
		<author>
			<persName><forename type="first">Abeed</forename><surname>Sarker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Social Network Analysis and Mining</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">45</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Portable automatic text classification for adverse drug reaction detection via multi-corpus training</title>
		<author>
			<persName><forename type="first">Abeed</forename><surname>Sarker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Graciela</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><surname>Sarker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rachel</forename><surname>Ginn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Azadeh</forename><surname>Nikfarjam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen O'</forename><surname>Connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Swetha</forename><surname>Jayaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tejaswi</forename><surname>Upadhaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Graciela</forename><surname>Gonzalez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="page" from="202" to="212" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
	<note>Journal of Biomedical Informatics</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Social media use in healthcare: A systematic review of effects on patients and on their relationship with healthcare professionals</title>
		<author>
			<persName><forename type="first">Edin</forename><surname>Smailhodzic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wyanda</forename><surname>Hooijsma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Albert</forename><surname>Boonstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">J</forename><surname>Langley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen O'</forename><surname>Sarker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rachel</forename><surname>Connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthew</forename><surname>Ginn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karen</forename><surname>Scotch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dan</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Graciela</forename><surname>Malone</surname></persName>
		</author>
		<author>
			<persName><surname>Gonzalez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Health Services Research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="231" to="240" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
	<note>Drug Safety</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The psychological meaning of words: LIWC and computerized text analysis methods</title>
		<author>
			<persName><forename type="first">R</forename><surname>Yla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><forename type="middle">W</forename><surname>Tausczik</surname></persName>
		</author>
		<author>
			<persName><surname>Pennebaker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Language and Social Psychology</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="24" to="54" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Social processes of online empowerment on a cancer patient discussion form: using text mining to analyze linguistic patterns of empowerment processes</title>
		<author>
			<persName><forename type="first">Suzan</forename><surname>Verberne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anika</forename><surname>Batenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Remco</forename><surname>Sanders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mies</forename><surname>Van Eenbergen ; Cornelia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Van Uden-Kraan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Constance</forename><forename type="middle">H</forename><surname>Drossaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erik</forename><surname>Taal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bret</forename><forename type="middle">R</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erwin</forename><forename type="middle">R</forename><surname>Seydel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mart</forename><forename type="middle">A F J</forename><surname>Van De Laar ; Cornelia ; Kraan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Constance</forename><forename type="middle">H C</forename><surname>Drossaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erik</forename><surname>Taal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erwin</forename><forename type="middle">R</forename><surname>Seydel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mart</forename><forename type="middle">A F J</forename><surname>Van De Laar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">JMIR Cancer</title>
				<imprint>
			<date type="published" when="2008">2008. 2009</date>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="61" to="69" />
		</imprint>
	</monogr>
	<note>Qualitative Health Research</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
