499

Information Labelling of Medical Forum Posts by Non-Clinical
Text Information Retrieval
Amit Kumar Kushwaha, Arpan Kumar Kar
Indian Institute of Technology Delhi, New Delhi, India


                Abstract
                With the advent of web 2.0, modern societies produce a vast amount of data, and merely keeping
                up with storage and transmission is difficult; analyzing it to extract useful information has
                become further challenging. All the historical research in healthcare data processing is more
                concentrated on formal clinical data. There lies a lot of valuable yet idle lying data in the non-
                clinical information as well. The proposed study combines the state of the art methods within
                distributed computing, text retrieval, clustering methods, and finally, using a classification
                method to a computationally efficient system that can clarify cancer patient trajectories based
                on non-clinical and freely available online forum posts. The motivation is that informed
                patients, caretakers, and relatives often lead to better overall treatment outcomes due to
                enhanced possibilities of proper disease management. The resulting software prototype is fully
                functional and built to serve as a test bench for various text information retrieval and
                visualization methods. Via the prototype, we demonstrate a computationally efficient clustering
                of posts into cancer-types and subsequent within-cluster classification into trajectory related
                classes. The system also provides an interactive graphical user interface allowing end-users to
                mine and oversee the valuable information.

                Keywords 1
                Machine Learning, Chatbot, Artificial Intelligence, Medical, Ontology


1. Introduction                                                                             medicinal side effects, treatment plans and
                                                                                            costs at each stage, any other non-documented
                                                                                            side effects, palliative care, and many more.
    Most of the patients that acquire a
                                                                                                Mis-informed outputs can lead to costlier
progressive and terminal disease are towards
                                                                                            yet un-successful and delayed treatment.
the latter end of life. Some of these illnesses are
                                                                                            Scholarly outputs, in turn, can have clarified
primarily respiratory disorders, cancers, and
                                                                                            trajectories of the timeframe and can lead to
cardiovascular. This illness implies a large time
                                                                                            better overall treatment owing to better clinical
frame for the patients themselves and the
                                                                                            sources and decisions. This further reduces the
surrounding relatives and caretakers [1], [2].
                                                                                            possibilities of fewer re-admissions, decreased
The trajectory of the timeframe can be
                                                                                            health care costs, and higher quality of life for
summarized as a sequence of steps shown in
                                                                                            patients in the potentially final weeks, months,
figure 1. Although the entire trajectory looks
                                                                                            and years. Unlimitedly, better overall care is
very simple and compact, they are complex and
                                                                                            obtainable via clarification during early stages,
contain a range of concerns underneath each of
                                                                                            estimation, and communication of patient-
the four steps. For instance: life expectancy at
                                                                                            specific symptoms and disease trajectories.
each stage, patterns of decline, probable
interactions with other health services,

ISIC’21: International Semantic Intelligence Conference,
February 25-27, 2021, Delhi, India
EMAIL:Kushwaha.amitkumar@gmail.com;
Arpan_kar@yahoo.co.in
ORCID: 0000-0002-5537-1250; 0000-0003-4186-4887
            ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
            Commons License Attribution 4.0 International (CC BY 4.0).

            CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                 500
    The proposed study is motivated by the idea      journals that capture doctors' comments,
of exploiting the relevant yet idle information      medical reports, and similarly discharge
in the ever-increasing user-generated content        summaries. In all these formal EHRs, the
through online and freely accessible non-            language of cause, symptoms, cures, and after-
clinical text for the benefit of anyone interested   effects is more concise, specific, and medical
in any clinical trajectory, e.g., cancer patients,   terms are used more distinctly from case to
COVID-19 patients. With the recent increase          case. These terms are way different from a
and rise in the overall COVID-19, there was          layperson's mention of terms in the same
massive unrest during the initial stages of the      context on the online forums. This adds to our
disease spread, where even the patients who          motivation to make this non-clinical data
were tested positive were not sure about the         available for a person in general.
trajectories of the sequence of steps in figure 1.       The current research objective is to clarify
This motivated me to further this study, which       and communicate the patient trajectories at
can be an essential literature contribution for      each stage by computationally efficient text
researchers and act as a day-to-day practice         information retrieval from non-clinical online
implication for someone who has internet but         forum post texts. Through the current study, the
cannot navigate through much-unstructured to         identified research objectives are met by
find simple, relevant clinical information.          building a fully functional and generalizable
    Historical data shows that approximately         framework that can screen/filter, process, and
one-third of the entire world's population gets      present the non-clinical data for clinical
diagnosed with cancer during their lifetime [3].     trajectory in a visually and informative way.
According to the World Health Organization           The framework is chosen to act as a test bench
(WHO) [4], as of 9th August 20 globally, there       for future text information retrieval methods
have 20 million patients tested positive for         and is not only restricted to the current study.
COVID-19. Thus, a large community of                 The current study's underlying premise is
potential end-users can consume the non-             unstructured inherent and valuable information,
clinical data for answering their queries related    which is freely available on non-clinical yet
to clinical trajectories. A cancer diagnosis or in   medical forums.
recent times COVID-19 leads to several
reactions, a predominantly one is first sought       2. Related work
information online on specific symptom, type
and severity and finally trajectory prognosis.           Scholars' research with the objectives,
    A trend that has recently gained prominence      methods, and hypothesis rooted in data mining
among the community is to communicate on             has been mostly focusing on text
online forums [5], [6], [7], [8], [9], [10]. On      summarization. In 2005, Murray et al. [2]
these medical forums, people have the right to       performed the clinical review research that
write freely on their emotions and what they         summarizes three disease trajectories: organ
feel about the disease, treatment, after-effects,    failure (heart and long), frail elderly, and
and normalcy after the treatment without             cancer. In another related study in 2010,
disclosing the identity. For instance: on cancer     Ebadollahi et al. [11] predicted a patient's
forums, people write freely about their initial      trajectory from temporal physiological data.
stage frustrations, fears, and how they              This study was further improved in a 2014
overcame them. The same applies to COVID-            research undertaken by Jensen et al. [12] with
19 forums too. Any healthcare system does not        the disease trajectory data spanning fifteen
leverage this freely available non-clinical,         years from a large patient population.
nonetheless less potentially very relevant               In 2016, Ji et al.[13] proposed a predictive
information.                                         model for health condition trajectory and co-
    Mining all such relevant information from a      morbidity relationships by training the social
wide variety, volume, and veracity of online         health records model. Another related study
user-generated content on the forums is an           was performed in 2017 by Jensen et al. [1] using
overlap of the technical-scientific research         text analysis using EHRs to predict patient
domain. It is more challenging than mining           (cancer) trajectories automatically. However,
standard health texts such as electronic health      summarizing all the above work, we interpret a
records (EHR), including hospital admission
                                                                                                             501
gap in text information retrieval using               distributed computing text retrieval through
distributed clustering and classification. None       clustering. This contribution is topped by a
of the highlighted studies have ventured using        coherent classification that is computationally
this framework, which can be computationally          efficient and can identify patient trajectories
efficient. At the end of the proposed current         based on non-clinical texts. Hence, the outcome
framework, a classification model can quickly         of the proposed study provides a unique and
identify patient trajectories using non-clinical      novel means for an individual and researchers
texts from online forums.                             looking for cancer and COVID-19 trajectories.
    Frunza et al. [14] did a related study in 2011;   This is done by activating relevant and
in their study, they automatically extract            potentially hitherto overlooked, by the
sentences from clinical papers about diseases         established health care systems, information
and treatments. Based on the extracted                hidden in non-clinical texts.
sentences, semantic relations between diseases
and associated treatments are then identified.        2.2.        Significance
Another related study was done by Rosario et
al. [15] in 2004. The focus of their work was to         In general, computational retrieval of
recognize text-entities containing information        information from the vast amounts of health
about diseases and treatments. They use Hidden        care texts is significant. Specifically, for this
Markov Models and Maximum Entropy                     study, the significance lies in the systematic
Models to perform the entity and disease-             combination of state-of-the-art methods to
treatment relationship recognition.                   mine, refine, categorize, and present
    Compared to Frunza et al., the later work         laypersons'     cancer     trajectory     related
focuses mostly on classification. In the              descriptions. It is significant to empower
proposed study, the present study also focuses        patients and caretakers and help build healthy
on text retrieval and clustering through the          patient/caretaker communities by leveraging
current study. The current proposed study will        the soft information not hitherto used by the
also focus on cancer and COVID-19                     established health care systems, e.g.,
trajectories, where-in the other studies have         information about emotions, feelings, or
only focused on cancer as a prevalent disease.        personal preferences.
    Lastly, in the 2011 study by Yang et al. [16],
Density-Based Clustering was used to identify
topics within online forum threads on social          3. Proposed framework
media. They also developed a visualization tool       3.1. Overview
to provide an overview of the identified topics.
Their tool's purpose was to extract topics with           The proposed study has four major building
sensitive information related to terrorism or         blocks or components, including a database
other criminal activities; however, it might also     component for storing the cluster outputs. A
be tailored to extract other topics. Besides using    detailed visual representation of the framework
DBSCAN, the study proposed a related                  is given in figure 1 below. It has been designed
clustering method, namely SDC (Scalable               in a micro-service architecture with one process
Density-based clustering). The structure of the       per component to make the framework light
Yang et al. study is, to some extent, as the          from a production implementation standpoint.
present study; individually, in the present study,
topics are also extracted from online forum
posts, density-based clustering is also used, and             Search

result visualization capabilities are also
                                                              Posts                          Informati
provided.                                                                                    on
                                                                                             retrieval
                                                                             Database        from text
                                                             Statistics

2.1. Research gap addressed by                               Clusters                        Clustering
the novelties of the current work
                                                              Tools            API          Classification


  The novelty of the proposed study is
combining the state of the art un-supervised
                                                                                                                                                                          502
Figure 1: Framework
                                                                                         Need Help !!
   The left part of the framework in figure 1                     Please type your query in the text box below
above is the front-ending component that
handles the user interaction. We will be further              Search

elaborating the same in section 3.2. The API
component's sole purpose is to enable the front
                                                          Endometrial Surgery, chemotherapy, radiation, plasma, a
end component to interact with the database and
with other service components. The Database               Bone Surgery, chemotherapy, radiation, plasma. admission

component persists all gathered forum posts               Prostate Surgery, chemotherapy, radiation, plasma, admis
and the computed results, e.g., clusters, classes,
and cancer-trajectories. The Service component        Figure 2: Front end search view
handles the computationally burdensome data
processing; the micro-service architecture               The user interface consists of five main
enables scaling of this component only.               views: Search, Posts, Statistics, Clusters, and
Implementing the service component as a               Tools (figure 2). In the Search view, a user can
scalable unit becomes well-suited for the             search the entire collection of forum posts, the
application of a distributed computing                identified clusters. Initially, a view of the types
approach. Especially the clustering calculations      of clusters, as shown in figure 3, will be
are burdensome and need to be made efficient.         displayed to the end-user. By clicking a type
Currently, the text retrieval and classification      cluster, all posts associated with that type of
calculations do not need to be scaled as they are     cluster is displayed in the Posts view. Users can
much faster than the clustering.
                                                      browse through the posts within a type of
                                                      cluster and by selecting a class-label.
3.2.    Front-end
                                                                                                            Statistics
    Having a front end to interact with data               Clusters
helps to explore results from the end-user's               Kidney – 160 Posts                      Heart - 70 Posts                      Liver - 150 Posts

perspective. The developed user interface is
useful for exploring the collected data set of
forum posts and to show information from an
area of interest. For instance: a user can select a            Cure
                                                               Side effect
                                                                             Disease
                                                                             No cure
                                                                                       Treatment     Cure
                                                                                                     Side effect
                                                                                                                   Disease
                                                                                                                   No cure
                                                                                                                             Treatment        Cure
                                                                                                                                              Side effect
                                                                                                                                                            Disease
                                                                                                                                                            No cure
                                                                                                                                                                      Treatment


cluster, i.e., a disease-type, of interest, e.g.,     Figure 3: Cluster view
COVID-19 or lymphoma cancer, and only
receive posts within that cluster. A user can also
choose a pivot of information, e.g., side effects     3.3. Functional validation of the
of COVID-19 medicine or side effects of cancer        outputs
radiation, and thereby see all posts from the
cancer cluster or COVID cluster that contains            The robustness of a framework is considered
information about side effects. Such a tool is        based on the statistical metrics and needs to be
relevant for scientific use and cancer patients       measured on the intuitiveness of the text
and caretakers.                                       outputs as received by the users. Hence in order
                                                      to concretely measure the outputs of the
                                                      framework, we check the output from the below
                                                      qualitative lens as well other than the statistical
                                                      metrics:
                                                      Functional intuitiveness:
                                                              • Appropriate,
                                                              • Suitable
                                                      Performance:
                                                              • Time,
                                                              • Utilization
                                                                                                   503
Compatibility:                                      their word stem [17]. Different algorithms for
       • Coherence                                  stemming exist in the literature, e.g., the Lovins
Scalability:                                        Stemmer [18], the Paice Stemmer [19], and the
       • Modular structure,                         predominant Porter Stemmer [20]. All these
       • Easily modifiable                          stemming algorithms are best suited for
Portability:                                        English; in the present study, the Porter
       • Easy installation                          Stemmer is used. The Porter Stemming
                                                    algorithm is based on five steps, and in each
                                                    step, a specified set of rules is applied to the
                                                    word being processed. For instance, the first
4. Information retrieval                            step contains the following processing rules, as
4.1. Data collection                                represented in figure 4. In the tokenization part,
                                                    character and word sequences are sliced into
    The data-set has been created by collecting     tokens. Typically, the tokens are words or
the texts from online posts on the medical          terms, but in this study, tokens are only words.
forums non-clinical. The posts are mostly           After the tokenization, stop words are removed.
written by person in-general and not doctors or
medical staff. Hence the topics and words used      4.3.    Information retrieval
are more day-to-day life and less skewed
towards specific medical terminologies.                 To make sure that the clustering of posts into
Typically, the data collected from posts will       a specific type of disease clusters to be accurate,
consist of symptoms, initial experiences,           information from all the collected posts' content
treatments, place where treated, post-treatment     attributes must be extracted. This is achieved by
experience, questions, side-effects, and            using various natural language processing [9]
outcomes. The most informative and                  and text retrieval, together with a predefined
unstructured data is stored in the actual text of   feature vector containing names of a range of
each row. This text's basis, the information        disease types. For current work, we use the term
retrieval framework proposed in the current         weighting approach. This approach uses term
study, extracts the relevant features for           frequency and inverse document frequency to
clustering. Often, these non-clinical texts         yield term frequency-inverse document
captured contain rather detailed descriptions of    frequency, which is the term's final weight. The
a disease (like cancer or COVID-19) and the         purpose of term frequency (tf) is to measure
specific treatment received.                        how often a term occurs in a specific text
                                                    corpus, i.e., in this study, tf is simply an
4.2.    Data preprocessing                          unadjusted count of term appearances.
                                                        Term frequency [21] can be defined as
    To ensure that the actual text information      tf(t,d) | occurrences of term t in document d.
retrieval works successfully, the collected text    Documents vary in length, which entails a bias
needs to be preprocessed and cleansed for any       in tf; that is, a term is likely to appear more
noise in the data. For the proposed research, we    often in a lengthy document than in a short
have conducted three preprocessing steps:           document, given the documents are similar in
                                                    content [22]. Whenever a term is frequent in a
       1. Cleansing,                                document, it is likely to be relevant to that
       2. Stemming, and                             specific document. The purpose of inverse
       3. Tokenization                              document frequency (IDF) is to measure the
                                                    weight of a term in a collection of documents; a
    The first step of cleansing consists of         rare term is often more valuable than a common
processes to remove unwanted characters, e.g.,      term in a collection of documents [23].
HTML tags, emojis, and ASCII-artworks. This             Term        frequency-inverse        document
is a non-trivial task when dealing with forum       frequency (tf-IDF) is a measure of how
posts as people express themselves quite            important a word is to a specific document in a
informally. In the second step of the stemming      collection of documents. A significant tf-idf
part, inflected and derived words are reduced to    weight is obtained whenever: 1. the term
                                                                                                                504
frequency is high for the specific document, and
2. the document frequency is low for the term               2.      If (p; q) is in C, then p is density-connected
across the collection of documents. Combining                       to q.
the tf and IDF weights tends to filter out                  To create a cluster, the DBSCAN algorithm
standard terms that do not carry much                   initiated an arbitrary point p and searched for all
information [24], [25].                                 the points in the density range of p with respect
                                                        to Ɛ and mpts. If p is a core point, then a new
                                                        cluster with p as a core point is created. If p is a
5. Clustering                                           border point, DBSCAN browses the next point
5.1. Existing DBSCAN clustering                         in the sample. DBSCAN can also merge any
                                                        two clusters into one of these clusters are in the
     Clustering is a process of grouping                same density range. The algorithm will
unlabeled data into clusters of homogenous              converge when no new points can be added to
attributes. The data points in each cluster have        any existing or new clusters.
similar traits, such that the variance within-
cluster is minimum, and variance across                 5.2. MapReduce                                   DBSCAN
clusters is maximum. In the proposed study, a
cluster would represent a homogenous group of           clustering
similar texts from posts. Density-Based Spatial
Clustering of Applications with Noise                       The entire process of DBSCAN clustering is
(DBSCAN) is a clustering algorithm based on             computationally costly with high time and
data points' density (also known as                     memory consumption. To reduce this
observations). DBSCAN helps to create                   consumption       and     increase efficiency,
clusters with a high density of data points, and        MapReduce DBSCAN was proposed. The only
in doing so, it allows clusters of any shape even       difference between a regular DBSCAN
if it contains noise, which is slightly different in    clustering and DBSCAN via MapReduce is
approach compared to conventional clustering            through distribution computation. The steps
algorithms.                                             followed in a MapReduce DBSCAN can be
          DBSCAN can now find clusters of               shown in figure 4 below.
different sizes and skip the input of taking the
number of clusters beforehand. In DBSCAN,
                                                                                                 Data
the Ɛ-neighborhood of point p will be defined                Database                            mapped

by the points within a radius Ɛ of p. If a point                                                 to cluster

p's Ɛ-neighborhood contains at least mpts
number of points, the point p is called a core               Partition                         Merging

point. A data point is called noise if it is not a
core point. A point p is in the density-range
from a point q if p is within the Ɛ-neighborhood                 DBSCAN                       Map the profile


of q, and q is a core point.                            Figure 4: MapReduce DBSCAN
A point p is defined as in the density range from
a point q with regard to Ɛ and mpts if there is a       5.3. Partition in                      MapReduce
chain of points, p1,…..,pn, where p1 = q and pn
= p such that pi+1 is in direct density range           DBSCAN clustering
from pi. A point p is defined as a density-
connected point to another point q with regards             To maximize runtime efficiency through
to Ɛ and mpts if only there is a point o such that      invoking, parallel processing can be achieved if
both p and q are density-reachable from o. A            the data is well balanced. If the data is well
point p is a border point if p's Ɛ neighborhood         balanced, then the computational load can be
contains less than mpts, and p is in direct density     evenly distributed on computer nodes'
from a core point. A cluster C is a non-empty           execution. In real-life text data, it is usually un-
set that satisfies the following two conditions         balanced, and the best strategy to deal with this
for all point pairs (p;q):                              is using data portioning. This is an inherent part
    1.   If p is in C and q is density-reachable from   of MapReduce DBSCAN.
         p, then q is also in C; and
                                                                                                         505
   Recursive split is the best and frequently         partitions are 1. Execute a nested loop on all
used data partitioning method, which helps split      points in the collected merge candidate lists to
the entire bigger data-set into smaller subsets.      see if the same data points exist with different
This is done recursively till a stop criterion is     local cluster IDs; 2. If found, then merge the
met. All partitions then contain less than a given    clusters.
number of points, or a given number of                    Figure 5 illustrates two examples of cluster-
partitions have been made. Logically, a               merge propositions. Example 1: the points d1
partition cannot be smaller than 2Ɛ; when a           belong to C1, and d2 belong to C2 are core
partition is split, the geometry must remain          points, and d2 is directly density-reachable from
extended beyond 2Ɛ. When splitting a partition        d1; thus, C1 should merge with C2. Example 2:
into two in MapReduce DBSCAN, all possible            The point d3 belongs to C1 is a core point, and r
splits are considered. The split that minimizes       belongs to C2 is a border point; thus, C1 should
the loss in one of the sub-partitions is chosen.      not merge with C2.
Here, the loss is calculated as the difference            Mapping Profile step where the purpose is to
between the number of points in sub-partition-        create a profile that maps clusters that should be
1 and half of the number of points in sub-            merged. The algorithm for generating the
partition-2. Each partition is given a key and        mapping profile is represented in the algorithm
associated with a reducer.                            in figure 5. The output of the algorithm is a list
                                                      of pairs of local clusters to be merged (denoted
                                                      MP) and a list of border points (denoted BP); a
5.4.    Local DBSCAN                                  point p is at least a border point in a merged
                                                      cluster (this is taken care of in the next step).
    Continuing the definition of reducer from
the previous paragraph, each reducer will be
given a partition and all its associated data
points, and hence a mapper should prepare all         1. for each cp in CP do
                                                      2.     for each bp in BP do
data related to a partition. Explaining the same      3.           if cp.id == bp.id then
concept using an example: the data assigned           4.              MP.add ((cp.local cluster id),

would be the related data Ci within Pi, and the       5.                        (bp.local cluster id))
                                                      6.               BP.delete(bp)
data within Pi's Ɛ-width extended partition Ri        7.            end if
that overlap the bordering partitions.                8.      end for

    Local DBSCAN borrows the working                  9. end for

principles from the original DBSCAN to                Figure 5: Merge mapping
perform the clustering. It starts with an arbitrary
data point p belonging to Ci and searches for         5.6.          Merge
points in the density of p with respect to Ɛ and
mpts. If p is a core point, the Ɛ neighborhood will
be explored for data points. If Local DBSCAN              The previous step resulted in a list of pairs
finds a point in the outer margin directly in the     of clusters to be merged. The IDs of the local
density range from a point in the inner margin,       clusters should be changed into a unique global
it is added to the merge-candidate set. If a core     ID after merging. Thus, a global perspective of
point is in the inner margin, it is also added to     all local clusters is built (algorithm in figure 6).
the merge-candidate set. Each point in the            Lastly, as mentioned in the previous step, noise
cluster is given a local cluster-id generated and     points are set to border points.
mapped from partition id and the label id from
the local clustering.

5.5.    Mapping profile
   After each partition has undergone
clustering and merge candidate lists have been
generated, the merge candidate lists are
collected to a single merge candidate list. The
basics of merging the clusters from the different
                                                                                                             506
      for each element pair ei , ej ɛ MP; i≠j; do
1.
2.        if ei , ej Ɛ L then
                                                                Table 1
3.           put ei and ej into the same Map Slot in L          Results
4.        end if
5.        if ei ɛ L ˄ ej ɛ L then
                                                                           Class label         Class
6.             put ej into ei's Map Slot in L                                              description
7.        end if
                                                                                               with
8.        if ei , ej Ɛ L then
9.           if ei and ej are not in the same Map Slot in L,                                example
             then move the Map Slot with the highest index to
             the Map Slot with the lowest index                                              posts in
10.        end if                                                                             italics
11. end for
12. return L                                                                  Cure            About
Figure 6: Global ID map                                                                      cancer-
                                                                                              curing
                                                                                          treatments.
6. Classification                                                                           After 16
                                                                                             chemo
    The result of the clustering is a set of                                              sessions, my
specific disease type clusters. To enable further                                          cancer was
filtering possibilities for the end-user, a within-
                                                                                              gone.
cluster classification is conducted such that
                                                                            No cure           About
each post within a disease type cluster is labeled
with one of the six labels illustrated in table 1.                                        cancer non-
This allows an end-user to filter the forum posts                                             curing
such that, for instance, only posts with specific                                         treatments.
disease (cluster) treatments (class) are shown.                                           My husband
    We have chosen to classify with a Naive                                                    went
Bayes classifier trained with a manually created                                            through
training set augmented with the freely available                                          chemo since
set from the BioText Project, UC, Berkeley                                                   he had
[26]. The Frunza et al. study also uses a Naive                                              bladder
Bayes classifier with promising results [9].                                                 cancer.
However, they classified abstracts from                                                     Sadly, he
scientific articles, which is a somewhat                                                     passed.
different data-domain than the present study's
non-clinical texts. The time complexity for
training a Naive Bayes classifier is O(np),
where n is the number of training observations,                 6.1.      Clustering
and p is the number of features; thus,
disregarding the constant, the complexity is in                     MapReduce DBSCAN is a distributed
terms of observations O(n). When testing,                       extension of DBSCAN, and they use the same
Naive Bayes is also linear, which is optimal for                principle for clustering. Thus, given the same
a classifier.                                                   input, the two clustering methods should yield
                                                                the same output. The results in this section
                                                                show that this is indeed the case, and we thereby
                                                                consider the implementations of MR-DBSCAN
                                                                and DBSCAN to be verified in terms of the
                                                                correctness of the logical output. The actual
                                                                implementations do not share code, so it seems
                                                                fair to disregard the odd risk of having both
                                                                implementations wrong in a manner that lead to
                                                                the same output.
                                                                    For comparing the clustering results of
                                                                DBSCAN and MR-DBSCAN, the Adjusted
                                                                Rand Index (ARI) [30] is used. The index is a
                                                                similarity measure between two clusterings,
                                                                                                        507
and it is obtained by counting the number of
identical labels assigned to the same clusters vs.        6.2. Real-time  analysis                       of
the number of identical labels assigned to
different clusters. If the label assignments
                                                          MapReduce DBSCAN
coincide fully, the index is 1, and if they do not
coincide at all, the index is 0. If DBSCAN and                The motivation behind the proposed study is
MR-DBSCAN are implemented correctly, the                  to demonstrate the real-time application of each
ARI must be one regardless of: 1. the number              of the MapReduce DBSCAN steps under
of points in the data set, 2. The number of               variations in
partitions in MR-DBSCAN, and 3. the                              1. the number of forum posts, and
parameter settings for Ɛ and mpts. Also, the                     2. the neighborhood radius Ɛ.
number of partitions (#P) in MapReduce                        These two parameters have the most
DBSCAN, the coverage percentage (%C), and                 significant      influence     on     MapReduce
the number of labels (#L) in DBSCAN and                   DBSCAN's runtime. The Ɛ parameter is used
MapReduce DBSCAN have been recorded.                      when partitioning the data set, and therefore, it
The results show (Table 2) that the ARI is 1 in           directly influences the beneficial effects of
all 18 test cases; a necessary condition for this         MapReduce. In all tests, the lower point-count
is that both MR-DSBCAN and DBSCAN yield                   threshold for establishing a core point, mpts, is
the same number of labels in all the tests also           fixed to 5 points. This is done as the parameter
the case (table 2).                                       only has very little runtime influence, and this
    Also, MR-DBSCAN has been partitioning                 influence is isolated to the DBSCAN step, i.e.,
its data into 3-8 partitions (table 2), which             it does not highlight runtime differences
means that even though the data has been split            between DBSCAN and MapReduce DBSCAN.
and clustered individually per partition, the                 For all 30 test cases (table 3), mapping takes
merging works as intended and yields the same             almost no time; merging has also only a little
clustering as DBSCAN. The coverage                        effect on runtime. For relatively large values of
percentage value is also identical for the two            Ɛ, i.e., 1 and 0.1, compared to the data span,
clusterings in all test cases.                            MapReduce DBSCAN cannot partition the data
                                                          set well. This affects the runtime as the
Table 2                                                   clustering is then performed on a single
Adjusted rank index of clustering                         partition (or very few), and no MapReduce
                                                          improvements are achieved. For relatively
  Posts   e     Mpts    DBSCAN        MapReduce     ARI
                                                          small values of Ɛ, i.e., 0.001 and 0.0005, the
                                       DBSCAN             data set is split well into partitions, but due to
  25000   10-
           3
                 5     10   3.34   10    3.34   8   1     the low value of Ɛ there are many possible
  25000   10-   50     2    2.99   2    2.99   8    1     partitions, and much time is spent in search of
           3                                              the best partitioning. Thus, as the results show,
  25000   10-   100    1    2.66   1    2.66   8    1
           3                                              the partitioning becomes slower when "
  25000   10-    5     10   3.34   10   3.34   7    1     decreases, but the local DBSCAN becomes
           2
  25000   10-   50     2    2.99   2    2.99   7    1     faster. Hence, Ɛ needs to be set with care to
           2                                              strike a balance and minimize the total runtime
  25000   10-   100    1    2.66   1    2.66   7    1
           2                                              of MapReduce DBSCAN. In our experiments,
  25000   10-    5     11   3.37   11   3.37   3    1     the balance is Ɛ = 0.01; here, the partitioning
           1
  25000   10-   50     2    2.99   2    2.99   3    1     runtime is relatively low, and likewise for the
           1                                              local DBSCAN; this results in a relatively low
  25000   10-   100    1    2.66   1    2.66   3    1
           1                                              total runtime.
  35000   10-    5     23   2.92   23   2.92   7    1
           3
  35000   10-   50     2    2.37   2    2.37   7    1     6.3.    Validation of clustering
           3
  35000   10-   100    1    2.02   1    2.02   7    1
           3
  35000   10-    5     23   2.92   23   2.92   6    1
                                                             The purpose of this experiment is to
           2                                              compare time as a function of the number of
  35000   10-   50     2    2.37   2    2.37   6    1     forum posts of the three different clustering
           2
                                                          algorithms DBSCAN, MapReduce DBSCAN,
                                                                                                                   508
and Hierarchical Density Estimates DBSCAN.                           worth making available to others in a more
Algorithm parameters are fixed and equal                             structured form. In the proposed study, this is
across the tests in order not to bias the results.                   achieved by a decision support system that can
Specifically, the lower point-count threshold                        act as a source of information to help any
for establishing a core point mpts = 50 and the                      disease patients like COVID-19, cancer and
neighborhood radius Ɛ = 0.01 for all tests. Note                     their caretakers and families to learn about the
that the setting Ɛ = 0.01 was previously found                       disease      trajectories,   initial    symptoms,
(section 7.2) to be a suitable choice for                            diagnoses outcomes, sources, treatment centers,
MapReduce DBSCAN. The data set in this                               treatment is taken, after-effects of treatment and
experiment are various subsets of the collected                      costs.
forum posts; the number of tf-idf features has                                Through the non-clinical posts on
been limited to 1000. The results of all tests are                   forums, the information retrieval framework
reported in table 3 and figure 7.                                    using text-retrieval, unsupervised clustering,
                                                                     and a classification model. The framework is
Table 3                                                              designed to execute on a distributed computing
Results of various clustering                                        set-up like         MapReduce        to increase
                                                                     computational efficiency. The response time of
 Posts MapReduc DBSCA                               Hierarchica
                                                                     a computationally costly clustering on texts
           e DBSCAN       N [s]                      l Density
                                                                     improves a lot, needed for a real-time
               [s]                                   Estimates       application.
                                                    DBSCAN [s]           Moreover, the endpoint of the current
              1000                       4.755         11.969        framework to the customer is a user interface
                             11.696
                0                                                    that enables the end-user to interact with the
              2000           19.545     21.167           48.731      database and mine for valuable information to
                0                                                    understand the overall trajectory of any disease.
              3000           31.115     50.237           105.007     This helps the patient be in a frame of mind
                0                                                    before getting a doctor's consultation and word.
              4000           37.321     92.392           217.033     This framework will also mobilize online social
                0                                                    communities of patients and their caretakers,
                                                                     families using soft information and non-
                                                                     clinical, hitherto conversations.
              350
                                                                         The proposed framework through the study
              300                                                    is an excellent contribution to the existing
              250
                                                                     literature in several different ways. Adding,
                                                                     refining, and benchmarking more clustering
                                                                     and classification methods would yield more
Seconds (s)


              200


              150                                                    comprehensive information through non-
              100
                                                                     clinical texts that might lead to better results,
                                                                     i.e.,    more       accurate    clustering     and
               50
                                                                     classifications, and thus, ultimately, a better
                0                                                    end-user service. The classification would
                     10000      20000   30000    40000       50000

                                        Posts
                                                                     mainly be of interest to collect and use a more
                                                                     extensive training set. The response time of
Figure 7: Comparison of clustering DBSCAN
                                                                     DBSCAN and Hierarchical Density Estimates
(blue), MapReduce DBSCAN (red), and
                                                                     DBSCAN clustering has been improved by
Hierarchical Density Estimates DBSCAN (green)                        redesigning the algorithms to guarantee upper
                                                                     bounds on memory consumption. This can act
                                                                     as a reference in literature for future
7. Discussion                                                        researchers.
                                                                         Lastly, in conclusion, the proposed system
    The primary motivation of the proposed                           and framework is easily generalizable such that
work and research undertaken to mine the                             it readily can be applied in other domains
clinical or medical information from non-clinal                      besides COVID-19 or cancer; by quickly
posts collected from forums is valuable and
                                                                                                 509
loading new data-sets and associated feature-
vectors.                                                  pp. 249–260, doi: 10.1007/978-3-030-
                                                          64849-7_22.
                                                     [9] A. K. Kushwaha, A. K. Kar, and P.
                                                          Vigneswara Ilavarasan, "Predicting
8. References                                             Information Diffusion on Twitter a Deep
                                                          Learning Neural Network Model Using
[1]   K. Jensen et al., "Analysis of free text in         Custom Weighted Word Features," in
      electronic health records for identification        Responsible Design, Implementation and
      of cancer patient trajectories," Scientific         Use of Information and Communication
      Reports, vol. 7, no. 1, art. no. 1, Apr.            Technology, Cham, 2020, pp. 456–468,
      2017, doi: 10.1038/srep46226.                       doi: 10.1007/978-3-030-44999-5_38.
[2]   S. A. Murray, M. Kendall, K. Boyd, and         [10] A. K. Kushwaha, S. Mandal, R.
      A. Sheikh, "Illness trajectories and                Pharswan, A. K. Kar, and P. V.
      palliative care," BMJ, vol. 330, no. 7498,          Ilavarasan, "Studying Online Political
      pp. 1007–1011, Apr. 2005, doi:                      Behaviours as Rituals: A Study of Social
      10.1136/bmj.330.7498.1007.                          Media Behaviour Regarding the CAA," in
[3]   "The      Danish       Cancer      Society,"        Re-imagining Diffusion and Adoption of
      International.                                      Information Technology and Systems: A
      https://www.cancer.dk/international/abo             Continuing Conversation, Cham, 2020,
      ut-the-danish-cancer-society/ (accessed             pp. 315–326, doi: 10.1007/978-3-030-
      09th August, 2020).                                 64861-9_28.
[4]   "WHO Coronavirus Disease (COVID-19)            [11] S. Ebadollahi, J. Sun, D. Gotz, J. Hu, D.
      Dashboard."         https://covid19.who.int         Sow, and C. Neti, "Predicting Patient's
      (accessed 09th August, 2020).                       Trajectory of Physiological Data using
[5]   G. Umefjord, K. Hamberg, H. Malker,                 Temporal Trends in Similar Patients: A
      and G. Petersson, "The use of an Internet-          System for Near-Term Prognostics,"
      based Ask the Doctor Service involving              Amia Annual Symposium, vol. 2010, pp.
      family physicians: evaluation by a web              192–196, 2010.
      survey," Fam Pract, vol. 23, no. 2, pp.        [12] "Temporal disease trajectories condensed
      159–166,         Apr.        2006,      doi:        from population-wide registry data
      10.1093/fampra/cmi117.                              covering 6.2 million patients | Nature
[6]   G. Umefjord, H. Sandström, H. Malker,               Communications."
      and G. Petersson, "Medical text-based               https://www.nature.com/articles/ncomms
      consultations on the Internet: A 4-year             5022 (accessed 09th August, 2020).
      study," International Journal of Medical       [13] X. Ji, S. A. Chun, and J. Geller,
      Informatics, vol. 77, no. 2, pp. 114–121,           "Predicting Comorbid Conditions and
      Feb.               2008,                doi:        Trajectories Using Social Health
      10.1016/j.ijmedinf.2007.01.009.                     Records," IEEE Transactions on
[7]   A. K. Kushwaha and A. K. Kar,                       NanoBioscience, vol. 15, no. 4, pp. 371–
      "Language Model-Driven Chatbot for                  379,        Jun.        2016,          doi:
      Business to Address Marketing and                   10.1109/TNB.2016.2564299.
      Selection of Products," in Re-imagining        [14] O. Frunza, D. Inkpen, and T. Tran, "A
      Diffusion and Adoption of Information               Machine Learning Approach for
      Technology and Systems: A Continuing                Identifying Disease-Treatment Relations
      Conversation, Cham, 2020, pp. 16–28,                in Short Texts," IEEE Transactions on
      doi: 10.1007/978-3-030-64849-7_3.                   Knowledge and Data Engineering, vol.
[8]   A. K. Kushwaha and A. K. Kar, "Micro-               23, no. 6, pp. 801–814, Jun. 2011, doi:
      foundations of Artificial Intelligence              10.1109/TKDE.2010.152.
      Adoption in Business: Making the Shift,"       [15] C. Lousteau-Cazalet et al., "A decision
      in Re-imagining Diffusion and Adoption              support    system     for    eco-efficient
      of Information Technology and Systems:              biorefinery process comparison using a
      A Continuing Conversation, Cham, 2020,              semantic approach," Computers and
                                                          Electronics in Agriculture, vol. 127, pp.
                                                                                           510
     351–367,        Sep.       2016,      doi:
     10.1016/j.compag.2016.06.020.                [26] "Classification of Diseases and their
[16] C. C. Yang and T. D. Ng, "Analyzing and           Treatments Using Machine Learning
     Visualizing Web Opinion Development               Approach - ProQuest."
     and Social Interactions With Density-             https://search.proquest.com/openview/42
     Based Clustering," IEEE Transactions on           3cca63369eb17808ce3e845e51b852/1?c
     Systems, Man, and Cybernetics - Part A:           bl=2029261&pq-origsite=gscholar
     Systems and Humans, vol. 41, no. 6, pp.           (accessed 12th August, 2020).
     1144–1155,        Nov.      2011,     doi:
     10.1109/TSMCA.2011.2113334.
[17] C. Manning, P. Raghavan, and H.
     Schuetze, "Introduction to Information
     Retrieval," p. 581, 2009.
[18] An algorithm for suffix stripping. 1980.
[19] J. A. Goldsmith, D. Higgins, and S.
     Soglasnova, "Automatic Language-
     Specific Stemming in Information
     Retrieval,"       in      Cross-Language
     Information Retrieval and Evaluation,
     Berlin, Heidelberg, 2001, pp. 273–283,
     doi: 10.1007/3-540-44645-1_27.
[20] C. H. Porter, L. E. Lynch, J. A. Herrig,
     and R. J. Ziebol, "(54) DEVICE AND
     METHOD FORVASCULAR ACCESS,"
     p. 60.
[21] S. E. Robertson and K. Spärck Jones,
     "Simple, proven approaches to text
     retrieval," University of Cambridge,
     Computer Laboratory, UCAM-CL-TR-
     356, 1994. Accessed: 10th August, 2020.
     [Online].                       Available:
     https://www.cl.cam.ac.uk/techreports/U
     CAM-CL-TR-356.html.
[22] S. E. Robertson and K. S. Jones,
     "Relevance weighting of search terms,"
     Journal of the American Society for
     Information Science, vol. 27, no. 3, pp.
     129–146,             1976,            doi:
     10.1002/asi.4630270302.
[23] S. Robertson, "Understanding inverse
     document frequency: on theoretical
     arguments for IDF," Journal of
     Documentation, vol. 60, no. 5, pp. 503–
     520,         Jan.        2004,        doi:
     10.1108/00220410410560582.
[24] Kar, Arpan, "Applications of Machine
     Learning in Business,"            Business
     Frontiers, 24th July, 2020. .
[25] A. Kar, "Understanding Machine
     Learning and Artificial Intelligence and
     their effects on Financial Systems –
     Business Fundas.".