499 Information Labelling of Medical Forum Posts by Non-Clinical Text Information Retrieval Amit Kumar Kushwaha, Arpan Kumar Kar Indian Institute of Technology Delhi, New Delhi, India Abstract With the advent of web 2.0, modern societies produce a vast amount of data, and merely keeping up with storage and transmission is difficult; analyzing it to extract useful information has become further challenging. All the historical research in healthcare data processing is more concentrated on formal clinical data. There lies a lot of valuable yet idle lying data in the non- clinical information as well. The proposed study combines the state of the art methods within distributed computing, text retrieval, clustering methods, and finally, using a classification method to a computationally efficient system that can clarify cancer patient trajectories based on non-clinical and freely available online forum posts. The motivation is that informed patients, caretakers, and relatives often lead to better overall treatment outcomes due to enhanced possibilities of proper disease management. The resulting software prototype is fully functional and built to serve as a test bench for various text information retrieval and visualization methods. Via the prototype, we demonstrate a computationally efficient clustering of posts into cancer-types and subsequent within-cluster classification into trajectory related classes. The system also provides an interactive graphical user interface allowing end-users to mine and oversee the valuable information. Keywords 1 Machine Learning, Chatbot, Artificial Intelligence, Medical, Ontology 1. Introduction medicinal side effects, treatment plans and costs at each stage, any other non-documented side effects, palliative care, and many more. Most of the patients that acquire a Mis-informed outputs can lead to costlier progressive and terminal disease are towards yet un-successful and delayed treatment. the latter end of life. Some of these illnesses are Scholarly outputs, in turn, can have clarified primarily respiratory disorders, cancers, and trajectories of the timeframe and can lead to cardiovascular. This illness implies a large time better overall treatment owing to better clinical frame for the patients themselves and the sources and decisions. This further reduces the surrounding relatives and caretakers [1], [2]. possibilities of fewer re-admissions, decreased The trajectory of the timeframe can be health care costs, and higher quality of life for summarized as a sequence of steps shown in patients in the potentially final weeks, months, figure 1. Although the entire trajectory looks and years. Unlimitedly, better overall care is very simple and compact, they are complex and obtainable via clarification during early stages, contain a range of concerns underneath each of estimation, and communication of patient- the four steps. For instance: life expectancy at specific symptoms and disease trajectories. each stage, patterns of decline, probable interactions with other health services, ISIC’21: International Semantic Intelligence Conference, February 25-27, 2021, Delhi, India EMAIL:Kushwaha.amitkumar@gmail.com; Arpan_kar@yahoo.co.in ORCID: 0000-0002-5537-1250; 0000-0003-4186-4887 ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 500 The proposed study is motivated by the idea journals that capture doctors' comments, of exploiting the relevant yet idle information medical reports, and similarly discharge in the ever-increasing user-generated content summaries. In all these formal EHRs, the through online and freely accessible non- language of cause, symptoms, cures, and after- clinical text for the benefit of anyone interested effects is more concise, specific, and medical in any clinical trajectory, e.g., cancer patients, terms are used more distinctly from case to COVID-19 patients. With the recent increase case. These terms are way different from a and rise in the overall COVID-19, there was layperson's mention of terms in the same massive unrest during the initial stages of the context on the online forums. This adds to our disease spread, where even the patients who motivation to make this non-clinical data were tested positive were not sure about the available for a person in general. trajectories of the sequence of steps in figure 1. The current research objective is to clarify This motivated me to further this study, which and communicate the patient trajectories at can be an essential literature contribution for each stage by computationally efficient text researchers and act as a day-to-day practice information retrieval from non-clinical online implication for someone who has internet but forum post texts. Through the current study, the cannot navigate through much-unstructured to identified research objectives are met by find simple, relevant clinical information. building a fully functional and generalizable Historical data shows that approximately framework that can screen/filter, process, and one-third of the entire world's population gets present the non-clinical data for clinical diagnosed with cancer during their lifetime [3]. trajectory in a visually and informative way. According to the World Health Organization The framework is chosen to act as a test bench (WHO) [4], as of 9th August 20 globally, there for future text information retrieval methods have 20 million patients tested positive for and is not only restricted to the current study. COVID-19. Thus, a large community of The current study's underlying premise is potential end-users can consume the non- unstructured inherent and valuable information, clinical data for answering their queries related which is freely available on non-clinical yet to clinical trajectories. A cancer diagnosis or in medical forums. recent times COVID-19 leads to several reactions, a predominantly one is first sought 2. Related work information online on specific symptom, type and severity and finally trajectory prognosis. Scholars' research with the objectives, A trend that has recently gained prominence methods, and hypothesis rooted in data mining among the community is to communicate on has been mostly focusing on text online forums [5], [6], [7], [8], [9], [10]. On summarization. In 2005, Murray et al. [2] these medical forums, people have the right to performed the clinical review research that write freely on their emotions and what they summarizes three disease trajectories: organ feel about the disease, treatment, after-effects, failure (heart and long), frail elderly, and and normalcy after the treatment without cancer. In another related study in 2010, disclosing the identity. For instance: on cancer Ebadollahi et al. [11] predicted a patient's forums, people write freely about their initial trajectory from temporal physiological data. stage frustrations, fears, and how they This study was further improved in a 2014 overcame them. The same applies to COVID- research undertaken by Jensen et al. [12] with 19 forums too. Any healthcare system does not the disease trajectory data spanning fifteen leverage this freely available non-clinical, years from a large patient population. nonetheless less potentially very relevant In 2016, Ji et al.[13] proposed a predictive information. model for health condition trajectory and co- Mining all such relevant information from a morbidity relationships by training the social wide variety, volume, and veracity of online health records model. Another related study user-generated content on the forums is an was performed in 2017 by Jensen et al. [1] using overlap of the technical-scientific research text analysis using EHRs to predict patient domain. It is more challenging than mining (cancer) trajectories automatically. However, standard health texts such as electronic health summarizing all the above work, we interpret a records (EHR), including hospital admission 501 gap in text information retrieval using distributed computing text retrieval through distributed clustering and classification. None clustering. This contribution is topped by a of the highlighted studies have ventured using coherent classification that is computationally this framework, which can be computationally efficient and can identify patient trajectories efficient. At the end of the proposed current based on non-clinical texts. Hence, the outcome framework, a classification model can quickly of the proposed study provides a unique and identify patient trajectories using non-clinical novel means for an individual and researchers texts from online forums. looking for cancer and COVID-19 trajectories. Frunza et al. [14] did a related study in 2011; This is done by activating relevant and in their study, they automatically extract potentially hitherto overlooked, by the sentences from clinical papers about diseases established health care systems, information and treatments. Based on the extracted hidden in non-clinical texts. sentences, semantic relations between diseases and associated treatments are then identified. 2.2. Significance Another related study was done by Rosario et al. [15] in 2004. The focus of their work was to In general, computational retrieval of recognize text-entities containing information information from the vast amounts of health about diseases and treatments. They use Hidden care texts is significant. Specifically, for this Markov Models and Maximum Entropy study, the significance lies in the systematic Models to perform the entity and disease- combination of state-of-the-art methods to treatment relationship recognition. mine, refine, categorize, and present Compared to Frunza et al., the later work laypersons' cancer trajectory related focuses mostly on classification. In the descriptions. It is significant to empower proposed study, the present study also focuses patients and caretakers and help build healthy on text retrieval and clustering through the patient/caretaker communities by leveraging current study. The current proposed study will the soft information not hitherto used by the also focus on cancer and COVID-19 established health care systems, e.g., trajectories, where-in the other studies have information about emotions, feelings, or only focused on cancer as a prevalent disease. personal preferences. Lastly, in the 2011 study by Yang et al. [16], Density-Based Clustering was used to identify topics within online forum threads on social 3. Proposed framework media. They also developed a visualization tool 3.1. Overview to provide an overview of the identified topics. Their tool's purpose was to extract topics with The proposed study has four major building sensitive information related to terrorism or blocks or components, including a database other criminal activities; however, it might also component for storing the cluster outputs. A be tailored to extract other topics. Besides using detailed visual representation of the framework DBSCAN, the study proposed a related is given in figure 1 below. It has been designed clustering method, namely SDC (Scalable in a micro-service architecture with one process Density-based clustering). The structure of the per component to make the framework light Yang et al. study is, to some extent, as the from a production implementation standpoint. present study; individually, in the present study, topics are also extracted from online forum posts, density-based clustering is also used, and Search result visualization capabilities are also Posts Informati provided. on retrieval Database from text Statistics 2.1. Research gap addressed by Clusters Clustering the novelties of the current work Tools API Classification The novelty of the proposed study is combining the state of the art un-supervised 502 Figure 1: Framework Need Help !! The left part of the framework in figure 1 Please type your query in the text box below above is the front-ending component that handles the user interaction. We will be further Search elaborating the same in section 3.2. The API component's sole purpose is to enable the front Endometrial Surgery, chemotherapy, radiation, plasma, a end component to interact with the database and with other service components. The Database Bone Surgery, chemotherapy, radiation, plasma. admission component persists all gathered forum posts Prostate Surgery, chemotherapy, radiation, plasma, admis and the computed results, e.g., clusters, classes, and cancer-trajectories. The Service component Figure 2: Front end search view handles the computationally burdensome data processing; the micro-service architecture The user interface consists of five main enables scaling of this component only. views: Search, Posts, Statistics, Clusters, and Implementing the service component as a Tools (figure 2). In the Search view, a user can scalable unit becomes well-suited for the search the entire collection of forum posts, the application of a distributed computing identified clusters. Initially, a view of the types approach. Especially the clustering calculations of clusters, as shown in figure 3, will be are burdensome and need to be made efficient. displayed to the end-user. By clicking a type Currently, the text retrieval and classification cluster, all posts associated with that type of calculations do not need to be scaled as they are cluster is displayed in the Posts view. Users can much faster than the clustering. browse through the posts within a type of cluster and by selecting a class-label. 3.2. Front-end Statistics Having a front end to interact with data Clusters helps to explore results from the end-user's Kidney – 160 Posts Heart - 70 Posts Liver - 150 Posts perspective. The developed user interface is useful for exploring the collected data set of forum posts and to show information from an area of interest. For instance: a user can select a Cure Side effect Disease No cure Treatment Cure Side effect Disease No cure Treatment Cure Side effect Disease No cure Treatment cluster, i.e., a disease-type, of interest, e.g., Figure 3: Cluster view COVID-19 or lymphoma cancer, and only receive posts within that cluster. A user can also choose a pivot of information, e.g., side effects 3.3. Functional validation of the of COVID-19 medicine or side effects of cancer outputs radiation, and thereby see all posts from the cancer cluster or COVID cluster that contains The robustness of a framework is considered information about side effects. Such a tool is based on the statistical metrics and needs to be relevant for scientific use and cancer patients measured on the intuitiveness of the text and caretakers. outputs as received by the users. Hence in order to concretely measure the outputs of the framework, we check the output from the below qualitative lens as well other than the statistical metrics: Functional intuitiveness: • Appropriate, • Suitable Performance: • Time, • Utilization 503 Compatibility: their word stem [17]. Different algorithms for • Coherence stemming exist in the literature, e.g., the Lovins Scalability: Stemmer [18], the Paice Stemmer [19], and the • Modular structure, predominant Porter Stemmer [20]. All these • Easily modifiable stemming algorithms are best suited for Portability: English; in the present study, the Porter • Easy installation Stemmer is used. The Porter Stemming algorithm is based on five steps, and in each step, a specified set of rules is applied to the word being processed. For instance, the first 4. Information retrieval step contains the following processing rules, as 4.1. Data collection represented in figure 4. In the tokenization part, character and word sequences are sliced into The data-set has been created by collecting tokens. Typically, the tokens are words or the texts from online posts on the medical terms, but in this study, tokens are only words. forums non-clinical. The posts are mostly After the tokenization, stop words are removed. written by person in-general and not doctors or medical staff. Hence the topics and words used 4.3. Information retrieval are more day-to-day life and less skewed towards specific medical terminologies. To make sure that the clustering of posts into Typically, the data collected from posts will a specific type of disease clusters to be accurate, consist of symptoms, initial experiences, information from all the collected posts' content treatments, place where treated, post-treatment attributes must be extracted. This is achieved by experience, questions, side-effects, and using various natural language processing [9] outcomes. The most informative and and text retrieval, together with a predefined unstructured data is stored in the actual text of feature vector containing names of a range of each row. This text's basis, the information disease types. For current work, we use the term retrieval framework proposed in the current weighting approach. This approach uses term study, extracts the relevant features for frequency and inverse document frequency to clustering. Often, these non-clinical texts yield term frequency-inverse document captured contain rather detailed descriptions of frequency, which is the term's final weight. The a disease (like cancer or COVID-19) and the purpose of term frequency (tf) is to measure specific treatment received. how often a term occurs in a specific text corpus, i.e., in this study, tf is simply an 4.2. Data preprocessing unadjusted count of term appearances. Term frequency [21] can be defined as To ensure that the actual text information tf(t,d) | occurrences of term t in document d. retrieval works successfully, the collected text Documents vary in length, which entails a bias needs to be preprocessed and cleansed for any in tf; that is, a term is likely to appear more noise in the data. For the proposed research, we often in a lengthy document than in a short have conducted three preprocessing steps: document, given the documents are similar in content [22]. Whenever a term is frequent in a 1. Cleansing, document, it is likely to be relevant to that 2. Stemming, and specific document. The purpose of inverse 3. Tokenization document frequency (IDF) is to measure the weight of a term in a collection of documents; a The first step of cleansing consists of rare term is often more valuable than a common processes to remove unwanted characters, e.g., term in a collection of documents [23]. HTML tags, emojis, and ASCII-artworks. This Term frequency-inverse document is a non-trivial task when dealing with forum frequency (tf-IDF) is a measure of how posts as people express themselves quite important a word is to a specific document in a informally. In the second step of the stemming collection of documents. A significant tf-idf part, inflected and derived words are reduced to weight is obtained whenever: 1. the term 504 frequency is high for the specific document, and 2. the document frequency is low for the term 2. If (p; q) is in C, then p is density-connected across the collection of documents. Combining to q. the tf and IDF weights tends to filter out To create a cluster, the DBSCAN algorithm standard terms that do not carry much initiated an arbitrary point p and searched for all information [24], [25]. the points in the density range of p with respect to Ɛ and mpts. If p is a core point, then a new cluster with p as a core point is created. If p is a 5. Clustering border point, DBSCAN browses the next point 5.1. Existing DBSCAN clustering in the sample. DBSCAN can also merge any two clusters into one of these clusters are in the Clustering is a process of grouping same density range. The algorithm will unlabeled data into clusters of homogenous converge when no new points can be added to attributes. The data points in each cluster have any existing or new clusters. similar traits, such that the variance within- cluster is minimum, and variance across 5.2. MapReduce DBSCAN clusters is maximum. In the proposed study, a cluster would represent a homogenous group of clustering similar texts from posts. Density-Based Spatial Clustering of Applications with Noise The entire process of DBSCAN clustering is (DBSCAN) is a clustering algorithm based on computationally costly with high time and data points' density (also known as memory consumption. To reduce this observations). DBSCAN helps to create consumption and increase efficiency, clusters with a high density of data points, and MapReduce DBSCAN was proposed. The only in doing so, it allows clusters of any shape even difference between a regular DBSCAN if it contains noise, which is slightly different in clustering and DBSCAN via MapReduce is approach compared to conventional clustering through distribution computation. The steps algorithms. followed in a MapReduce DBSCAN can be DBSCAN can now find clusters of shown in figure 4 below. different sizes and skip the input of taking the number of clusters beforehand. In DBSCAN, Data the Ɛ-neighborhood of point p will be defined Database mapped by the points within a radius Ɛ of p. If a point to cluster p's Ɛ-neighborhood contains at least mpts number of points, the point p is called a core Partition Merging point. A data point is called noise if it is not a core point. A point p is in the density-range from a point q if p is within the Ɛ-neighborhood DBSCAN Map the profile of q, and q is a core point. Figure 4: MapReduce DBSCAN A point p is defined as in the density range from a point q with regard to Ɛ and mpts if there is a 5.3. Partition in MapReduce chain of points, p1,…..,pn, where p1 = q and pn = p such that pi+1 is in direct density range DBSCAN clustering from pi. A point p is defined as a density- connected point to another point q with regards To maximize runtime efficiency through to Ɛ and mpts if only there is a point o such that invoking, parallel processing can be achieved if both p and q are density-reachable from o. A the data is well balanced. If the data is well point p is a border point if p's Ɛ neighborhood balanced, then the computational load can be contains less than mpts, and p is in direct density evenly distributed on computer nodes' from a core point. A cluster C is a non-empty execution. In real-life text data, it is usually un- set that satisfies the following two conditions balanced, and the best strategy to deal with this for all point pairs (p;q): is using data portioning. This is an inherent part 1. If p is in C and q is density-reachable from of MapReduce DBSCAN. p, then q is also in C; and 505 Recursive split is the best and frequently partitions are 1. Execute a nested loop on all used data partitioning method, which helps split points in the collected merge candidate lists to the entire bigger data-set into smaller subsets. see if the same data points exist with different This is done recursively till a stop criterion is local cluster IDs; 2. If found, then merge the met. All partitions then contain less than a given clusters. number of points, or a given number of Figure 5 illustrates two examples of cluster- partitions have been made. Logically, a merge propositions. Example 1: the points d1 partition cannot be smaller than 2Ɛ; when a belong to C1, and d2 belong to C2 are core partition is split, the geometry must remain points, and d2 is directly density-reachable from extended beyond 2Ɛ. When splitting a partition d1; thus, C1 should merge with C2. Example 2: into two in MapReduce DBSCAN, all possible The point d3 belongs to C1 is a core point, and r splits are considered. The split that minimizes belongs to C2 is a border point; thus, C1 should the loss in one of the sub-partitions is chosen. not merge with C2. Here, the loss is calculated as the difference Mapping Profile step where the purpose is to between the number of points in sub-partition- create a profile that maps clusters that should be 1 and half of the number of points in sub- merged. The algorithm for generating the partition-2. Each partition is given a key and mapping profile is represented in the algorithm associated with a reducer. in figure 5. The output of the algorithm is a list of pairs of local clusters to be merged (denoted MP) and a list of border points (denoted BP); a 5.4. Local DBSCAN point p is at least a border point in a merged cluster (this is taken care of in the next step). Continuing the definition of reducer from the previous paragraph, each reducer will be given a partition and all its associated data points, and hence a mapper should prepare all 1. for each cp in CP do 2. for each bp in BP do data related to a partition. Explaining the same 3. if cp.id == bp.id then concept using an example: the data assigned 4. MP.add ((cp.local cluster id), would be the related data Ci within Pi, and the 5. (bp.local cluster id)) 6. BP.delete(bp) data within Pi's Ɛ-width extended partition Ri 7. end if that overlap the bordering partitions. 8. end for Local DBSCAN borrows the working 9. end for principles from the original DBSCAN to Figure 5: Merge mapping perform the clustering. It starts with an arbitrary data point p belonging to Ci and searches for 5.6. Merge points in the density of p with respect to Ɛ and mpts. If p is a core point, the Ɛ neighborhood will be explored for data points. If Local DBSCAN The previous step resulted in a list of pairs finds a point in the outer margin directly in the of clusters to be merged. The IDs of the local density range from a point in the inner margin, clusters should be changed into a unique global it is added to the merge-candidate set. If a core ID after merging. Thus, a global perspective of point is in the inner margin, it is also added to all local clusters is built (algorithm in figure 6). the merge-candidate set. Each point in the Lastly, as mentioned in the previous step, noise cluster is given a local cluster-id generated and points are set to border points. mapped from partition id and the label id from the local clustering. 5.5. Mapping profile After each partition has undergone clustering and merge candidate lists have been generated, the merge candidate lists are collected to a single merge candidate list. The basics of merging the clusters from the different 506 for each element pair ei , ej ɛ MP; i≠j; do 1. 2. if ei , ej Ɛ L then Table 1 3. put ei and ej into the same Map Slot in L Results 4. end if 5. if ei ɛ L ˄ ej ɛ L then Class label Class 6. put ej into ei's Map Slot in L description 7. end if with 8. if ei , ej Ɛ L then 9. if ei and ej are not in the same Map Slot in L, example then move the Map Slot with the highest index to the Map Slot with the lowest index posts in 10. end if italics 11. end for 12. return L Cure About Figure 6: Global ID map cancer- curing treatments. 6. Classification After 16 chemo The result of the clustering is a set of sessions, my specific disease type clusters. To enable further cancer was filtering possibilities for the end-user, a within- gone. cluster classification is conducted such that No cure About each post within a disease type cluster is labeled with one of the six labels illustrated in table 1. cancer non- This allows an end-user to filter the forum posts curing such that, for instance, only posts with specific treatments. disease (cluster) treatments (class) are shown. My husband We have chosen to classify with a Naive went Bayes classifier trained with a manually created through training set augmented with the freely available chemo since set from the BioText Project, UC, Berkeley he had [26]. The Frunza et al. study also uses a Naive bladder Bayes classifier with promising results [9]. cancer. However, they classified abstracts from Sadly, he scientific articles, which is a somewhat passed. different data-domain than the present study's non-clinical texts. The time complexity for training a Naive Bayes classifier is O(np), where n is the number of training observations, 6.1. Clustering and p is the number of features; thus, disregarding the constant, the complexity is in MapReduce DBSCAN is a distributed terms of observations O(n). When testing, extension of DBSCAN, and they use the same Naive Bayes is also linear, which is optimal for principle for clustering. Thus, given the same a classifier. input, the two clustering methods should yield the same output. The results in this section show that this is indeed the case, and we thereby consider the implementations of MR-DBSCAN and DBSCAN to be verified in terms of the correctness of the logical output. The actual implementations do not share code, so it seems fair to disregard the odd risk of having both implementations wrong in a manner that lead to the same output. For comparing the clustering results of DBSCAN and MR-DBSCAN, the Adjusted Rand Index (ARI) [30] is used. The index is a similarity measure between two clusterings, 507 and it is obtained by counting the number of identical labels assigned to the same clusters vs. 6.2. Real-time analysis of the number of identical labels assigned to different clusters. If the label assignments MapReduce DBSCAN coincide fully, the index is 1, and if they do not coincide at all, the index is 0. If DBSCAN and The motivation behind the proposed study is MR-DBSCAN are implemented correctly, the to demonstrate the real-time application of each ARI must be one regardless of: 1. the number of the MapReduce DBSCAN steps under of points in the data set, 2. The number of variations in partitions in MR-DBSCAN, and 3. the 1. the number of forum posts, and parameter settings for Ɛ and mpts. Also, the 2. the neighborhood radius Ɛ. number of partitions (#P) in MapReduce These two parameters have the most DBSCAN, the coverage percentage (%C), and significant influence on MapReduce the number of labels (#L) in DBSCAN and DBSCAN's runtime. The Ɛ parameter is used MapReduce DBSCAN have been recorded. when partitioning the data set, and therefore, it The results show (Table 2) that the ARI is 1 in directly influences the beneficial effects of all 18 test cases; a necessary condition for this MapReduce. In all tests, the lower point-count is that both MR-DSBCAN and DBSCAN yield threshold for establishing a core point, mpts, is the same number of labels in all the tests also fixed to 5 points. This is done as the parameter the case (table 2). only has very little runtime influence, and this Also, MR-DBSCAN has been partitioning influence is isolated to the DBSCAN step, i.e., its data into 3-8 partitions (table 2), which it does not highlight runtime differences means that even though the data has been split between DBSCAN and MapReduce DBSCAN. and clustered individually per partition, the For all 30 test cases (table 3), mapping takes merging works as intended and yields the same almost no time; merging has also only a little clustering as DBSCAN. The coverage effect on runtime. For relatively large values of percentage value is also identical for the two Ɛ, i.e., 1 and 0.1, compared to the data span, clusterings in all test cases. MapReduce DBSCAN cannot partition the data set well. This affects the runtime as the Table 2 clustering is then performed on a single Adjusted rank index of clustering partition (or very few), and no MapReduce improvements are achieved. For relatively Posts e Mpts DBSCAN MapReduce ARI small values of Ɛ, i.e., 0.001 and 0.0005, the DBSCAN data set is split well into partitions, but due to 25000 10- 3 5 10 3.34 10 3.34 8 1 the low value of Ɛ there are many possible 25000 10- 50 2 2.99 2 2.99 8 1 partitions, and much time is spent in search of 3 the best partitioning. Thus, as the results show, 25000 10- 100 1 2.66 1 2.66 8 1 3 the partitioning becomes slower when " 25000 10- 5 10 3.34 10 3.34 7 1 decreases, but the local DBSCAN becomes 2 25000 10- 50 2 2.99 2 2.99 7 1 faster. Hence, Ɛ needs to be set with care to 2 strike a balance and minimize the total runtime 25000 10- 100 1 2.66 1 2.66 7 1 2 of MapReduce DBSCAN. In our experiments, 25000 10- 5 11 3.37 11 3.37 3 1 the balance is Ɛ = 0.01; here, the partitioning 1 25000 10- 50 2 2.99 2 2.99 3 1 runtime is relatively low, and likewise for the 1 local DBSCAN; this results in a relatively low 25000 10- 100 1 2.66 1 2.66 3 1 1 total runtime. 35000 10- 5 23 2.92 23 2.92 7 1 3 35000 10- 50 2 2.37 2 2.37 7 1 6.3. Validation of clustering 3 35000 10- 100 1 2.02 1 2.02 7 1 3 35000 10- 5 23 2.92 23 2.92 6 1 The purpose of this experiment is to 2 compare time as a function of the number of 35000 10- 50 2 2.37 2 2.37 6 1 forum posts of the three different clustering 2 algorithms DBSCAN, MapReduce DBSCAN, 508 and Hierarchical Density Estimates DBSCAN. worth making available to others in a more Algorithm parameters are fixed and equal structured form. In the proposed study, this is across the tests in order not to bias the results. achieved by a decision support system that can Specifically, the lower point-count threshold act as a source of information to help any for establishing a core point mpts = 50 and the disease patients like COVID-19, cancer and neighborhood radius Ɛ = 0.01 for all tests. Note their caretakers and families to learn about the that the setting Ɛ = 0.01 was previously found disease trajectories, initial symptoms, (section 7.2) to be a suitable choice for diagnoses outcomes, sources, treatment centers, MapReduce DBSCAN. The data set in this treatment is taken, after-effects of treatment and experiment are various subsets of the collected costs. forum posts; the number of tf-idf features has Through the non-clinical posts on been limited to 1000. The results of all tests are forums, the information retrieval framework reported in table 3 and figure 7. using text-retrieval, unsupervised clustering, and a classification model. The framework is Table 3 designed to execute on a distributed computing Results of various clustering set-up like MapReduce to increase computational efficiency. The response time of Posts MapReduc DBSCA Hierarchica a computationally costly clustering on texts e DBSCAN N [s] l Density improves a lot, needed for a real-time [s] Estimates application. DBSCAN [s] Moreover, the endpoint of the current 1000 4.755 11.969 framework to the customer is a user interface 11.696 0 that enables the end-user to interact with the 2000 19.545 21.167 48.731 database and mine for valuable information to 0 understand the overall trajectory of any disease. 3000 31.115 50.237 105.007 This helps the patient be in a frame of mind 0 before getting a doctor's consultation and word. 4000 37.321 92.392 217.033 This framework will also mobilize online social 0 communities of patients and their caretakers, families using soft information and non- clinical, hitherto conversations. 350 The proposed framework through the study 300 is an excellent contribution to the existing 250 literature in several different ways. Adding, refining, and benchmarking more clustering and classification methods would yield more Seconds (s) 200 150 comprehensive information through non- 100 clinical texts that might lead to better results, i.e., more accurate clustering and 50 classifications, and thus, ultimately, a better 0 end-user service. The classification would 10000 20000 30000 40000 50000 Posts mainly be of interest to collect and use a more extensive training set. The response time of Figure 7: Comparison of clustering DBSCAN DBSCAN and Hierarchical Density Estimates (blue), MapReduce DBSCAN (red), and DBSCAN clustering has been improved by Hierarchical Density Estimates DBSCAN (green) redesigning the algorithms to guarantee upper bounds on memory consumption. This can act as a reference in literature for future 7. Discussion researchers. Lastly, in conclusion, the proposed system The primary motivation of the proposed and framework is easily generalizable such that work and research undertaken to mine the it readily can be applied in other domains clinical or medical information from non-clinal besides COVID-19 or cancer; by quickly posts collected from forums is valuable and 509 loading new data-sets and associated feature- vectors. pp. 249–260, doi: 10.1007/978-3-030- 64849-7_22. [9] A. K. Kushwaha, A. K. Kar, and P. Vigneswara Ilavarasan, "Predicting 8. References Information Diffusion on Twitter a Deep Learning Neural Network Model Using [1] K. Jensen et al., "Analysis of free text in Custom Weighted Word Features," in electronic health records for identification Responsible Design, Implementation and of cancer patient trajectories," Scientific Use of Information and Communication Reports, vol. 7, no. 1, art. no. 1, Apr. Technology, Cham, 2020, pp. 456–468, 2017, doi: 10.1038/srep46226. doi: 10.1007/978-3-030-44999-5_38. [2] S. A. Murray, M. Kendall, K. Boyd, and [10] A. K. Kushwaha, S. Mandal, R. A. Sheikh, "Illness trajectories and Pharswan, A. K. Kar, and P. V. palliative care," BMJ, vol. 330, no. 7498, Ilavarasan, "Studying Online Political pp. 1007–1011, Apr. 2005, doi: Behaviours as Rituals: A Study of Social 10.1136/bmj.330.7498.1007. Media Behaviour Regarding the CAA," in [3] "The Danish Cancer Society," Re-imagining Diffusion and Adoption of International. Information Technology and Systems: A https://www.cancer.dk/international/abo Continuing Conversation, Cham, 2020, ut-the-danish-cancer-society/ (accessed pp. 315–326, doi: 10.1007/978-3-030- 09th August, 2020). 64861-9_28. [4] "WHO Coronavirus Disease (COVID-19) [11] S. Ebadollahi, J. Sun, D. Gotz, J. Hu, D. Dashboard." https://covid19.who.int Sow, and C. Neti, "Predicting Patient's (accessed 09th August, 2020). Trajectory of Physiological Data using [5] G. Umefjord, K. Hamberg, H. Malker, Temporal Trends in Similar Patients: A and G. Petersson, "The use of an Internet- System for Near-Term Prognostics," based Ask the Doctor Service involving Amia Annual Symposium, vol. 2010, pp. family physicians: evaluation by a web 192–196, 2010. survey," Fam Pract, vol. 23, no. 2, pp. [12] "Temporal disease trajectories condensed 159–166, Apr. 2006, doi: from population-wide registry data 10.1093/fampra/cmi117. covering 6.2 million patients | Nature [6] G. Umefjord, H. Sandström, H. Malker, Communications." and G. Petersson, "Medical text-based https://www.nature.com/articles/ncomms consultations on the Internet: A 4-year 5022 (accessed 09th August, 2020). study," International Journal of Medical [13] X. Ji, S. A. Chun, and J. Geller, Informatics, vol. 77, no. 2, pp. 114–121, "Predicting Comorbid Conditions and Feb. 2008, doi: Trajectories Using Social Health 10.1016/j.ijmedinf.2007.01.009. Records," IEEE Transactions on [7] A. K. Kushwaha and A. K. Kar, NanoBioscience, vol. 15, no. 4, pp. 371– "Language Model-Driven Chatbot for 379, Jun. 2016, doi: Business to Address Marketing and 10.1109/TNB.2016.2564299. Selection of Products," in Re-imagining [14] O. Frunza, D. Inkpen, and T. Tran, "A Diffusion and Adoption of Information Machine Learning Approach for Technology and Systems: A Continuing Identifying Disease-Treatment Relations Conversation, Cham, 2020, pp. 16–28, in Short Texts," IEEE Transactions on doi: 10.1007/978-3-030-64849-7_3. Knowledge and Data Engineering, vol. [8] A. K. Kushwaha and A. K. Kar, "Micro- 23, no. 6, pp. 801–814, Jun. 2011, doi: foundations of Artificial Intelligence 10.1109/TKDE.2010.152. Adoption in Business: Making the Shift," [15] C. Lousteau-Cazalet et al., "A decision in Re-imagining Diffusion and Adoption support system for eco-efficient of Information Technology and Systems: biorefinery process comparison using a A Continuing Conversation, Cham, 2020, semantic approach," Computers and Electronics in Agriculture, vol. 127, pp. 510 351–367, Sep. 2016, doi: 10.1016/j.compag.2016.06.020. [26] "Classification of Diseases and their [16] C. C. Yang and T. D. Ng, "Analyzing and Treatments Using Machine Learning Visualizing Web Opinion Development Approach - ProQuest." and Social Interactions With Density- https://search.proquest.com/openview/42 Based Clustering," IEEE Transactions on 3cca63369eb17808ce3e845e51b852/1?c Systems, Man, and Cybernetics - Part A: bl=2029261&pq-origsite=gscholar Systems and Humans, vol. 41, no. 6, pp. (accessed 12th August, 2020). 1144–1155, Nov. 2011, doi: 10.1109/TSMCA.2011.2113334. [17] C. Manning, P. Raghavan, and H. Schuetze, "Introduction to Information Retrieval," p. 581, 2009. [18] An algorithm for suffix stripping. 1980. [19] J. A. Goldsmith, D. Higgins, and S. Soglasnova, "Automatic Language- Specific Stemming in Information Retrieval," in Cross-Language Information Retrieval and Evaluation, Berlin, Heidelberg, 2001, pp. 273–283, doi: 10.1007/3-540-44645-1_27. [20] C. H. Porter, L. E. Lynch, J. A. Herrig, and R. J. Ziebol, "(54) DEVICE AND METHOD FORVASCULAR ACCESS," p. 60. [21] S. E. Robertson and K. Spärck Jones, "Simple, proven approaches to text retrieval," University of Cambridge, Computer Laboratory, UCAM-CL-TR- 356, 1994. Accessed: 10th August, 2020. [Online]. Available: https://www.cl.cam.ac.uk/techreports/U CAM-CL-TR-356.html. [22] S. E. Robertson and K. S. Jones, "Relevance weighting of search terms," Journal of the American Society for Information Science, vol. 27, no. 3, pp. 129–146, 1976, doi: 10.1002/asi.4630270302. [23] S. Robertson, "Understanding inverse document frequency: on theoretical arguments for IDF," Journal of Documentation, vol. 60, no. 5, pp. 503– 520, Jan. 2004, doi: 10.1108/00220410410560582. [24] Kar, Arpan, "Applications of Machine Learning in Business," Business Frontiers, 24th July, 2020. . [25] A. Kar, "Understanding Machine Learning and Artificial Intelligence and their effects on Financial Systems – Business Fundas.".