=Paper=
{{Paper
|id=Vol-2786/Paper59
|storemode=property
|title=Information Labelling of Medical Forum Posts by Non-Clinical Text Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper59.pdf
|volume=Vol-2786
|authors=Amit Kumar Kushwaha,Arpan Kumar Kar
|dblpUrl=https://dblp.org/rec/conf/isic2/KushwahaK21
}}
==Information Labelling of Medical Forum Posts by Non-Clinical Text Information Retrieval==
499
Information Labelling of Medical Forum Posts by Non-Clinical
Text Information Retrieval
Amit Kumar Kushwaha, Arpan Kumar Kar
Indian Institute of Technology Delhi, New Delhi, India
Abstract
With the advent of web 2.0, modern societies produce a vast amount of data, and merely keeping
up with storage and transmission is difficult; analyzing it to extract useful information has
become further challenging. All the historical research in healthcare data processing is more
concentrated on formal clinical data. There lies a lot of valuable yet idle lying data in the non-
clinical information as well. The proposed study combines the state of the art methods within
distributed computing, text retrieval, clustering methods, and finally, using a classification
method to a computationally efficient system that can clarify cancer patient trajectories based
on non-clinical and freely available online forum posts. The motivation is that informed
patients, caretakers, and relatives often lead to better overall treatment outcomes due to
enhanced possibilities of proper disease management. The resulting software prototype is fully
functional and built to serve as a test bench for various text information retrieval and
visualization methods. Via the prototype, we demonstrate a computationally efficient clustering
of posts into cancer-types and subsequent within-cluster classification into trajectory related
classes. The system also provides an interactive graphical user interface allowing end-users to
mine and oversee the valuable information.
Keywords 1
Machine Learning, Chatbot, Artificial Intelligence, Medical, Ontology
1. Introduction medicinal side effects, treatment plans and
costs at each stage, any other non-documented
side effects, palliative care, and many more.
Most of the patients that acquire a
Mis-informed outputs can lead to costlier
progressive and terminal disease are towards
yet un-successful and delayed treatment.
the latter end of life. Some of these illnesses are
Scholarly outputs, in turn, can have clarified
primarily respiratory disorders, cancers, and
trajectories of the timeframe and can lead to
cardiovascular. This illness implies a large time
better overall treatment owing to better clinical
frame for the patients themselves and the
sources and decisions. This further reduces the
surrounding relatives and caretakers [1], [2].
possibilities of fewer re-admissions, decreased
The trajectory of the timeframe can be
health care costs, and higher quality of life for
summarized as a sequence of steps shown in
patients in the potentially final weeks, months,
figure 1. Although the entire trajectory looks
and years. Unlimitedly, better overall care is
very simple and compact, they are complex and
obtainable via clarification during early stages,
contain a range of concerns underneath each of
estimation, and communication of patient-
the four steps. For instance: life expectancy at
specific symptoms and disease trajectories.
each stage, patterns of decline, probable
interactions with other health services,
ISIC’21: International Semantic Intelligence Conference,
February 25-27, 2021, Delhi, India
EMAIL:Kushwaha.amitkumar@gmail.com;
Arpan_kar@yahoo.co.in
ORCID: 0000-0002-5537-1250; 0000-0003-4186-4887
©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
500
The proposed study is motivated by the idea journals that capture doctors' comments,
of exploiting the relevant yet idle information medical reports, and similarly discharge
in the ever-increasing user-generated content summaries. In all these formal EHRs, the
through online and freely accessible non- language of cause, symptoms, cures, and after-
clinical text for the benefit of anyone interested effects is more concise, specific, and medical
in any clinical trajectory, e.g., cancer patients, terms are used more distinctly from case to
COVID-19 patients. With the recent increase case. These terms are way different from a
and rise in the overall COVID-19, there was layperson's mention of terms in the same
massive unrest during the initial stages of the context on the online forums. This adds to our
disease spread, where even the patients who motivation to make this non-clinical data
were tested positive were not sure about the available for a person in general.
trajectories of the sequence of steps in figure 1. The current research objective is to clarify
This motivated me to further this study, which and communicate the patient trajectories at
can be an essential literature contribution for each stage by computationally efficient text
researchers and act as a day-to-day practice information retrieval from non-clinical online
implication for someone who has internet but forum post texts. Through the current study, the
cannot navigate through much-unstructured to identified research objectives are met by
find simple, relevant clinical information. building a fully functional and generalizable
Historical data shows that approximately framework that can screen/filter, process, and
one-third of the entire world's population gets present the non-clinical data for clinical
diagnosed with cancer during their lifetime [3]. trajectory in a visually and informative way.
According to the World Health Organization The framework is chosen to act as a test bench
(WHO) [4], as of 9th August 20 globally, there for future text information retrieval methods
have 20 million patients tested positive for and is not only restricted to the current study.
COVID-19. Thus, a large community of The current study's underlying premise is
potential end-users can consume the non- unstructured inherent and valuable information,
clinical data for answering their queries related which is freely available on non-clinical yet
to clinical trajectories. A cancer diagnosis or in medical forums.
recent times COVID-19 leads to several
reactions, a predominantly one is first sought 2. Related work
information online on specific symptom, type
and severity and finally trajectory prognosis. Scholars' research with the objectives,
A trend that has recently gained prominence methods, and hypothesis rooted in data mining
among the community is to communicate on has been mostly focusing on text
online forums [5], [6], [7], [8], [9], [10]. On summarization. In 2005, Murray et al. [2]
these medical forums, people have the right to performed the clinical review research that
write freely on their emotions and what they summarizes three disease trajectories: organ
feel about the disease, treatment, after-effects, failure (heart and long), frail elderly, and
and normalcy after the treatment without cancer. In another related study in 2010,
disclosing the identity. For instance: on cancer Ebadollahi et al. [11] predicted a patient's
forums, people write freely about their initial trajectory from temporal physiological data.
stage frustrations, fears, and how they This study was further improved in a 2014
overcame them. The same applies to COVID- research undertaken by Jensen et al. [12] with
19 forums too. Any healthcare system does not the disease trajectory data spanning fifteen
leverage this freely available non-clinical, years from a large patient population.
nonetheless less potentially very relevant In 2016, Ji et al.[13] proposed a predictive
information. model for health condition trajectory and co-
Mining all such relevant information from a morbidity relationships by training the social
wide variety, volume, and veracity of online health records model. Another related study
user-generated content on the forums is an was performed in 2017 by Jensen et al. [1] using
overlap of the technical-scientific research text analysis using EHRs to predict patient
domain. It is more challenging than mining (cancer) trajectories automatically. However,
standard health texts such as electronic health summarizing all the above work, we interpret a
records (EHR), including hospital admission
501
gap in text information retrieval using distributed computing text retrieval through
distributed clustering and classification. None clustering. This contribution is topped by a
of the highlighted studies have ventured using coherent classification that is computationally
this framework, which can be computationally efficient and can identify patient trajectories
efficient. At the end of the proposed current based on non-clinical texts. Hence, the outcome
framework, a classification model can quickly of the proposed study provides a unique and
identify patient trajectories using non-clinical novel means for an individual and researchers
texts from online forums. looking for cancer and COVID-19 trajectories.
Frunza et al. [14] did a related study in 2011; This is done by activating relevant and
in their study, they automatically extract potentially hitherto overlooked, by the
sentences from clinical papers about diseases established health care systems, information
and treatments. Based on the extracted hidden in non-clinical texts.
sentences, semantic relations between diseases
and associated treatments are then identified. 2.2. Significance
Another related study was done by Rosario et
al. [15] in 2004. The focus of their work was to In general, computational retrieval of
recognize text-entities containing information information from the vast amounts of health
about diseases and treatments. They use Hidden care texts is significant. Specifically, for this
Markov Models and Maximum Entropy study, the significance lies in the systematic
Models to perform the entity and disease- combination of state-of-the-art methods to
treatment relationship recognition. mine, refine, categorize, and present
Compared to Frunza et al., the later work laypersons' cancer trajectory related
focuses mostly on classification. In the descriptions. It is significant to empower
proposed study, the present study also focuses patients and caretakers and help build healthy
on text retrieval and clustering through the patient/caretaker communities by leveraging
current study. The current proposed study will the soft information not hitherto used by the
also focus on cancer and COVID-19 established health care systems, e.g.,
trajectories, where-in the other studies have information about emotions, feelings, or
only focused on cancer as a prevalent disease. personal preferences.
Lastly, in the 2011 study by Yang et al. [16],
Density-Based Clustering was used to identify
topics within online forum threads on social 3. Proposed framework
media. They also developed a visualization tool 3.1. Overview
to provide an overview of the identified topics.
Their tool's purpose was to extract topics with The proposed study has four major building
sensitive information related to terrorism or blocks or components, including a database
other criminal activities; however, it might also component for storing the cluster outputs. A
be tailored to extract other topics. Besides using detailed visual representation of the framework
DBSCAN, the study proposed a related is given in figure 1 below. It has been designed
clustering method, namely SDC (Scalable in a micro-service architecture with one process
Density-based clustering). The structure of the per component to make the framework light
Yang et al. study is, to some extent, as the from a production implementation standpoint.
present study; individually, in the present study,
topics are also extracted from online forum
posts, density-based clustering is also used, and Search
result visualization capabilities are also
Posts Informati
provided. on
retrieval
Database from text
Statistics
2.1. Research gap addressed by Clusters Clustering
the novelties of the current work
Tools API Classification
The novelty of the proposed study is
combining the state of the art un-supervised
502
Figure 1: Framework
Need Help !!
The left part of the framework in figure 1 Please type your query in the text box below
above is the front-ending component that
handles the user interaction. We will be further Search
elaborating the same in section 3.2. The API
component's sole purpose is to enable the front
Endometrial Surgery, chemotherapy, radiation, plasma, a
end component to interact with the database and
with other service components. The Database Bone Surgery, chemotherapy, radiation, plasma. admission
component persists all gathered forum posts Prostate Surgery, chemotherapy, radiation, plasma, admis
and the computed results, e.g., clusters, classes,
and cancer-trajectories. The Service component Figure 2: Front end search view
handles the computationally burdensome data
processing; the micro-service architecture The user interface consists of five main
enables scaling of this component only. views: Search, Posts, Statistics, Clusters, and
Implementing the service component as a Tools (figure 2). In the Search view, a user can
scalable unit becomes well-suited for the search the entire collection of forum posts, the
application of a distributed computing identified clusters. Initially, a view of the types
approach. Especially the clustering calculations of clusters, as shown in figure 3, will be
are burdensome and need to be made efficient. displayed to the end-user. By clicking a type
Currently, the text retrieval and classification cluster, all posts associated with that type of
calculations do not need to be scaled as they are cluster is displayed in the Posts view. Users can
much faster than the clustering.
browse through the posts within a type of
cluster and by selecting a class-label.
3.2. Front-end
Statistics
Having a front end to interact with data Clusters
helps to explore results from the end-user's Kidney – 160 Posts Heart - 70 Posts Liver - 150 Posts
perspective. The developed user interface is
useful for exploring the collected data set of
forum posts and to show information from an
area of interest. For instance: a user can select a Cure
Side effect
Disease
No cure
Treatment Cure
Side effect
Disease
No cure
Treatment Cure
Side effect
Disease
No cure
Treatment
cluster, i.e., a disease-type, of interest, e.g., Figure 3: Cluster view
COVID-19 or lymphoma cancer, and only
receive posts within that cluster. A user can also
choose a pivot of information, e.g., side effects 3.3. Functional validation of the
of COVID-19 medicine or side effects of cancer outputs
radiation, and thereby see all posts from the
cancer cluster or COVID cluster that contains The robustness of a framework is considered
information about side effects. Such a tool is based on the statistical metrics and needs to be
relevant for scientific use and cancer patients measured on the intuitiveness of the text
and caretakers. outputs as received by the users. Hence in order
to concretely measure the outputs of the
framework, we check the output from the below
qualitative lens as well other than the statistical
metrics:
Functional intuitiveness:
• Appropriate,
• Suitable
Performance:
• Time,
• Utilization
503
Compatibility: their word stem [17]. Different algorithms for
• Coherence stemming exist in the literature, e.g., the Lovins
Scalability: Stemmer [18], the Paice Stemmer [19], and the
• Modular structure, predominant Porter Stemmer [20]. All these
• Easily modifiable stemming algorithms are best suited for
Portability: English; in the present study, the Porter
• Easy installation Stemmer is used. The Porter Stemming
algorithm is based on five steps, and in each
step, a specified set of rules is applied to the
word being processed. For instance, the first
4. Information retrieval step contains the following processing rules, as
4.1. Data collection represented in figure 4. In the tokenization part,
character and word sequences are sliced into
The data-set has been created by collecting tokens. Typically, the tokens are words or
the texts from online posts on the medical terms, but in this study, tokens are only words.
forums non-clinical. The posts are mostly After the tokenization, stop words are removed.
written by person in-general and not doctors or
medical staff. Hence the topics and words used 4.3. Information retrieval
are more day-to-day life and less skewed
towards specific medical terminologies. To make sure that the clustering of posts into
Typically, the data collected from posts will a specific type of disease clusters to be accurate,
consist of symptoms, initial experiences, information from all the collected posts' content
treatments, place where treated, post-treatment attributes must be extracted. This is achieved by
experience, questions, side-effects, and using various natural language processing [9]
outcomes. The most informative and and text retrieval, together with a predefined
unstructured data is stored in the actual text of feature vector containing names of a range of
each row. This text's basis, the information disease types. For current work, we use the term
retrieval framework proposed in the current weighting approach. This approach uses term
study, extracts the relevant features for frequency and inverse document frequency to
clustering. Often, these non-clinical texts yield term frequency-inverse document
captured contain rather detailed descriptions of frequency, which is the term's final weight. The
a disease (like cancer or COVID-19) and the purpose of term frequency (tf) is to measure
specific treatment received. how often a term occurs in a specific text
corpus, i.e., in this study, tf is simply an
4.2. Data preprocessing unadjusted count of term appearances.
Term frequency [21] can be defined as
To ensure that the actual text information tf(t,d) | occurrences of term t in document d.
retrieval works successfully, the collected text Documents vary in length, which entails a bias
needs to be preprocessed and cleansed for any in tf; that is, a term is likely to appear more
noise in the data. For the proposed research, we often in a lengthy document than in a short
have conducted three preprocessing steps: document, given the documents are similar in
content [22]. Whenever a term is frequent in a
1. Cleansing, document, it is likely to be relevant to that
2. Stemming, and specific document. The purpose of inverse
3. Tokenization document frequency (IDF) is to measure the
weight of a term in a collection of documents; a
The first step of cleansing consists of rare term is often more valuable than a common
processes to remove unwanted characters, e.g., term in a collection of documents [23].
HTML tags, emojis, and ASCII-artworks. This Term frequency-inverse document
is a non-trivial task when dealing with forum frequency (tf-IDF) is a measure of how
posts as people express themselves quite important a word is to a specific document in a
informally. In the second step of the stemming collection of documents. A significant tf-idf
part, inflected and derived words are reduced to weight is obtained whenever: 1. the term
504
frequency is high for the specific document, and
2. the document frequency is low for the term 2. If (p; q) is in C, then p is density-connected
across the collection of documents. Combining to q.
the tf and IDF weights tends to filter out To create a cluster, the DBSCAN algorithm
standard terms that do not carry much initiated an arbitrary point p and searched for all
information [24], [25]. the points in the density range of p with respect
to Ɛ and mpts. If p is a core point, then a new
cluster with p as a core point is created. If p is a
5. Clustering border point, DBSCAN browses the next point
5.1. Existing DBSCAN clustering in the sample. DBSCAN can also merge any
two clusters into one of these clusters are in the
Clustering is a process of grouping same density range. The algorithm will
unlabeled data into clusters of homogenous converge when no new points can be added to
attributes. The data points in each cluster have any existing or new clusters.
similar traits, such that the variance within-
cluster is minimum, and variance across 5.2. MapReduce DBSCAN
clusters is maximum. In the proposed study, a
cluster would represent a homogenous group of clustering
similar texts from posts. Density-Based Spatial
Clustering of Applications with Noise The entire process of DBSCAN clustering is
(DBSCAN) is a clustering algorithm based on computationally costly with high time and
data points' density (also known as memory consumption. To reduce this
observations). DBSCAN helps to create consumption and increase efficiency,
clusters with a high density of data points, and MapReduce DBSCAN was proposed. The only
in doing so, it allows clusters of any shape even difference between a regular DBSCAN
if it contains noise, which is slightly different in clustering and DBSCAN via MapReduce is
approach compared to conventional clustering through distribution computation. The steps
algorithms. followed in a MapReduce DBSCAN can be
DBSCAN can now find clusters of shown in figure 4 below.
different sizes and skip the input of taking the
number of clusters beforehand. In DBSCAN,
Data
the Ɛ-neighborhood of point p will be defined Database mapped
by the points within a radius Ɛ of p. If a point to cluster
p's Ɛ-neighborhood contains at least mpts
number of points, the point p is called a core Partition Merging
point. A data point is called noise if it is not a
core point. A point p is in the density-range
from a point q if p is within the Ɛ-neighborhood DBSCAN Map the profile
of q, and q is a core point. Figure 4: MapReduce DBSCAN
A point p is defined as in the density range from
a point q with regard to Ɛ and mpts if there is a 5.3. Partition in MapReduce
chain of points, p1,…..,pn, where p1 = q and pn
= p such that pi+1 is in direct density range DBSCAN clustering
from pi. A point p is defined as a density-
connected point to another point q with regards To maximize runtime efficiency through
to Ɛ and mpts if only there is a point o such that invoking, parallel processing can be achieved if
both p and q are density-reachable from o. A the data is well balanced. If the data is well
point p is a border point if p's Ɛ neighborhood balanced, then the computational load can be
contains less than mpts, and p is in direct density evenly distributed on computer nodes'
from a core point. A cluster C is a non-empty execution. In real-life text data, it is usually un-
set that satisfies the following two conditions balanced, and the best strategy to deal with this
for all point pairs (p;q): is using data portioning. This is an inherent part
1. If p is in C and q is density-reachable from of MapReduce DBSCAN.
p, then q is also in C; and
505
Recursive split is the best and frequently partitions are 1. Execute a nested loop on all
used data partitioning method, which helps split points in the collected merge candidate lists to
the entire bigger data-set into smaller subsets. see if the same data points exist with different
This is done recursively till a stop criterion is local cluster IDs; 2. If found, then merge the
met. All partitions then contain less than a given clusters.
number of points, or a given number of Figure 5 illustrates two examples of cluster-
partitions have been made. Logically, a merge propositions. Example 1: the points d1
partition cannot be smaller than 2Ɛ; when a belong to C1, and d2 belong to C2 are core
partition is split, the geometry must remain points, and d2 is directly density-reachable from
extended beyond 2Ɛ. When splitting a partition d1; thus, C1 should merge with C2. Example 2:
into two in MapReduce DBSCAN, all possible The point d3 belongs to C1 is a core point, and r
splits are considered. The split that minimizes belongs to C2 is a border point; thus, C1 should
the loss in one of the sub-partitions is chosen. not merge with C2.
Here, the loss is calculated as the difference Mapping Profile step where the purpose is to
between the number of points in sub-partition- create a profile that maps clusters that should be
1 and half of the number of points in sub- merged. The algorithm for generating the
partition-2. Each partition is given a key and mapping profile is represented in the algorithm
associated with a reducer. in figure 5. The output of the algorithm is a list
of pairs of local clusters to be merged (denoted
MP) and a list of border points (denoted BP); a
5.4. Local DBSCAN point p is at least a border point in a merged
cluster (this is taken care of in the next step).
Continuing the definition of reducer from
the previous paragraph, each reducer will be
given a partition and all its associated data
points, and hence a mapper should prepare all 1. for each cp in CP do
2. for each bp in BP do
data related to a partition. Explaining the same 3. if cp.id == bp.id then
concept using an example: the data assigned 4. MP.add ((cp.local cluster id),
would be the related data Ci within Pi, and the 5. (bp.local cluster id))
6. BP.delete(bp)
data within Pi's Ɛ-width extended partition Ri 7. end if
that overlap the bordering partitions. 8. end for
Local DBSCAN borrows the working 9. end for
principles from the original DBSCAN to Figure 5: Merge mapping
perform the clustering. It starts with an arbitrary
data point p belonging to Ci and searches for 5.6. Merge
points in the density of p with respect to Ɛ and
mpts. If p is a core point, the Ɛ neighborhood will
be explored for data points. If Local DBSCAN The previous step resulted in a list of pairs
finds a point in the outer margin directly in the of clusters to be merged. The IDs of the local
density range from a point in the inner margin, clusters should be changed into a unique global
it is added to the merge-candidate set. If a core ID after merging. Thus, a global perspective of
point is in the inner margin, it is also added to all local clusters is built (algorithm in figure 6).
the merge-candidate set. Each point in the Lastly, as mentioned in the previous step, noise
cluster is given a local cluster-id generated and points are set to border points.
mapped from partition id and the label id from
the local clustering.
5.5. Mapping profile
After each partition has undergone
clustering and merge candidate lists have been
generated, the merge candidate lists are
collected to a single merge candidate list. The
basics of merging the clusters from the different
506
for each element pair ei , ej ɛ MP; i≠j; do
1.
2. if ei , ej Ɛ L then
Table 1
3. put ei and ej into the same Map Slot in L Results
4. end if
5. if ei ɛ L ˄ ej ɛ L then
Class label Class
6. put ej into ei's Map Slot in L description
7. end if
with
8. if ei , ej Ɛ L then
9. if ei and ej are not in the same Map Slot in L, example
then move the Map Slot with the highest index to
the Map Slot with the lowest index posts in
10. end if italics
11. end for
12. return L Cure About
Figure 6: Global ID map cancer-
curing
treatments.
6. Classification After 16
chemo
The result of the clustering is a set of sessions, my
specific disease type clusters. To enable further cancer was
filtering possibilities for the end-user, a within-
gone.
cluster classification is conducted such that
No cure About
each post within a disease type cluster is labeled
with one of the six labels illustrated in table 1. cancer non-
This allows an end-user to filter the forum posts curing
such that, for instance, only posts with specific treatments.
disease (cluster) treatments (class) are shown. My husband
We have chosen to classify with a Naive went
Bayes classifier trained with a manually created through
training set augmented with the freely available chemo since
set from the BioText Project, UC, Berkeley he had
[26]. The Frunza et al. study also uses a Naive bladder
Bayes classifier with promising results [9]. cancer.
However, they classified abstracts from Sadly, he
scientific articles, which is a somewhat passed.
different data-domain than the present study's
non-clinical texts. The time complexity for
training a Naive Bayes classifier is O(np),
where n is the number of training observations, 6.1. Clustering
and p is the number of features; thus,
disregarding the constant, the complexity is in MapReduce DBSCAN is a distributed
terms of observations O(n). When testing, extension of DBSCAN, and they use the same
Naive Bayes is also linear, which is optimal for principle for clustering. Thus, given the same
a classifier. input, the two clustering methods should yield
the same output. The results in this section
show that this is indeed the case, and we thereby
consider the implementations of MR-DBSCAN
and DBSCAN to be verified in terms of the
correctness of the logical output. The actual
implementations do not share code, so it seems
fair to disregard the odd risk of having both
implementations wrong in a manner that lead to
the same output.
For comparing the clustering results of
DBSCAN and MR-DBSCAN, the Adjusted
Rand Index (ARI) [30] is used. The index is a
similarity measure between two clusterings,
507
and it is obtained by counting the number of
identical labels assigned to the same clusters vs. 6.2. Real-time analysis of
the number of identical labels assigned to
different clusters. If the label assignments
MapReduce DBSCAN
coincide fully, the index is 1, and if they do not
coincide at all, the index is 0. If DBSCAN and The motivation behind the proposed study is
MR-DBSCAN are implemented correctly, the to demonstrate the real-time application of each
ARI must be one regardless of: 1. the number of the MapReduce DBSCAN steps under
of points in the data set, 2. The number of variations in
partitions in MR-DBSCAN, and 3. the 1. the number of forum posts, and
parameter settings for Ɛ and mpts. Also, the 2. the neighborhood radius Ɛ.
number of partitions (#P) in MapReduce These two parameters have the most
DBSCAN, the coverage percentage (%C), and significant influence on MapReduce
the number of labels (#L) in DBSCAN and DBSCAN's runtime. The Ɛ parameter is used
MapReduce DBSCAN have been recorded. when partitioning the data set, and therefore, it
The results show (Table 2) that the ARI is 1 in directly influences the beneficial effects of
all 18 test cases; a necessary condition for this MapReduce. In all tests, the lower point-count
is that both MR-DSBCAN and DBSCAN yield threshold for establishing a core point, mpts, is
the same number of labels in all the tests also fixed to 5 points. This is done as the parameter
the case (table 2). only has very little runtime influence, and this
Also, MR-DBSCAN has been partitioning influence is isolated to the DBSCAN step, i.e.,
its data into 3-8 partitions (table 2), which it does not highlight runtime differences
means that even though the data has been split between DBSCAN and MapReduce DBSCAN.
and clustered individually per partition, the For all 30 test cases (table 3), mapping takes
merging works as intended and yields the same almost no time; merging has also only a little
clustering as DBSCAN. The coverage effect on runtime. For relatively large values of
percentage value is also identical for the two Ɛ, i.e., 1 and 0.1, compared to the data span,
clusterings in all test cases. MapReduce DBSCAN cannot partition the data
set well. This affects the runtime as the
Table 2 clustering is then performed on a single
Adjusted rank index of clustering partition (or very few), and no MapReduce
improvements are achieved. For relatively
Posts e Mpts DBSCAN MapReduce ARI
small values of Ɛ, i.e., 0.001 and 0.0005, the
DBSCAN data set is split well into partitions, but due to
25000 10-
3
5 10 3.34 10 3.34 8 1 the low value of Ɛ there are many possible
25000 10- 50 2 2.99 2 2.99 8 1 partitions, and much time is spent in search of
3 the best partitioning. Thus, as the results show,
25000 10- 100 1 2.66 1 2.66 8 1
3 the partitioning becomes slower when "
25000 10- 5 10 3.34 10 3.34 7 1 decreases, but the local DBSCAN becomes
2
25000 10- 50 2 2.99 2 2.99 7 1 faster. Hence, Ɛ needs to be set with care to
2 strike a balance and minimize the total runtime
25000 10- 100 1 2.66 1 2.66 7 1
2 of MapReduce DBSCAN. In our experiments,
25000 10- 5 11 3.37 11 3.37 3 1 the balance is Ɛ = 0.01; here, the partitioning
1
25000 10- 50 2 2.99 2 2.99 3 1 runtime is relatively low, and likewise for the
1 local DBSCAN; this results in a relatively low
25000 10- 100 1 2.66 1 2.66 3 1
1 total runtime.
35000 10- 5 23 2.92 23 2.92 7 1
3
35000 10- 50 2 2.37 2 2.37 7 1 6.3. Validation of clustering
3
35000 10- 100 1 2.02 1 2.02 7 1
3
35000 10- 5 23 2.92 23 2.92 6 1
The purpose of this experiment is to
2 compare time as a function of the number of
35000 10- 50 2 2.37 2 2.37 6 1 forum posts of the three different clustering
2
algorithms DBSCAN, MapReduce DBSCAN,
508
and Hierarchical Density Estimates DBSCAN. worth making available to others in a more
Algorithm parameters are fixed and equal structured form. In the proposed study, this is
across the tests in order not to bias the results. achieved by a decision support system that can
Specifically, the lower point-count threshold act as a source of information to help any
for establishing a core point mpts = 50 and the disease patients like COVID-19, cancer and
neighborhood radius Ɛ = 0.01 for all tests. Note their caretakers and families to learn about the
that the setting Ɛ = 0.01 was previously found disease trajectories, initial symptoms,
(section 7.2) to be a suitable choice for diagnoses outcomes, sources, treatment centers,
MapReduce DBSCAN. The data set in this treatment is taken, after-effects of treatment and
experiment are various subsets of the collected costs.
forum posts; the number of tf-idf features has Through the non-clinical posts on
been limited to 1000. The results of all tests are forums, the information retrieval framework
reported in table 3 and figure 7. using text-retrieval, unsupervised clustering,
and a classification model. The framework is
Table 3 designed to execute on a distributed computing
Results of various clustering set-up like MapReduce to increase
computational efficiency. The response time of
Posts MapReduc DBSCA Hierarchica
a computationally costly clustering on texts
e DBSCAN N [s] l Density
improves a lot, needed for a real-time
[s] Estimates application.
DBSCAN [s] Moreover, the endpoint of the current
1000 4.755 11.969 framework to the customer is a user interface
11.696
0 that enables the end-user to interact with the
2000 19.545 21.167 48.731 database and mine for valuable information to
0 understand the overall trajectory of any disease.
3000 31.115 50.237 105.007 This helps the patient be in a frame of mind
0 before getting a doctor's consultation and word.
4000 37.321 92.392 217.033 This framework will also mobilize online social
0 communities of patients and their caretakers,
families using soft information and non-
clinical, hitherto conversations.
350
The proposed framework through the study
300 is an excellent contribution to the existing
250
literature in several different ways. Adding,
refining, and benchmarking more clustering
and classification methods would yield more
Seconds (s)
200
150 comprehensive information through non-
100
clinical texts that might lead to better results,
i.e., more accurate clustering and
50
classifications, and thus, ultimately, a better
0 end-user service. The classification would
10000 20000 30000 40000 50000
Posts
mainly be of interest to collect and use a more
extensive training set. The response time of
Figure 7: Comparison of clustering DBSCAN
DBSCAN and Hierarchical Density Estimates
(blue), MapReduce DBSCAN (red), and
DBSCAN clustering has been improved by
Hierarchical Density Estimates DBSCAN (green) redesigning the algorithms to guarantee upper
bounds on memory consumption. This can act
as a reference in literature for future
7. Discussion researchers.
Lastly, in conclusion, the proposed system
The primary motivation of the proposed and framework is easily generalizable such that
work and research undertaken to mine the it readily can be applied in other domains
clinical or medical information from non-clinal besides COVID-19 or cancer; by quickly
posts collected from forums is valuable and
509
loading new data-sets and associated feature-
vectors. pp. 249–260, doi: 10.1007/978-3-030-
64849-7_22.
[9] A. K. Kushwaha, A. K. Kar, and P.
Vigneswara Ilavarasan, "Predicting
8. References Information Diffusion on Twitter a Deep
Learning Neural Network Model Using
[1] K. Jensen et al., "Analysis of free text in Custom Weighted Word Features," in
electronic health records for identification Responsible Design, Implementation and
of cancer patient trajectories," Scientific Use of Information and Communication
Reports, vol. 7, no. 1, art. no. 1, Apr. Technology, Cham, 2020, pp. 456–468,
2017, doi: 10.1038/srep46226. doi: 10.1007/978-3-030-44999-5_38.
[2] S. A. Murray, M. Kendall, K. Boyd, and [10] A. K. Kushwaha, S. Mandal, R.
A. Sheikh, "Illness trajectories and Pharswan, A. K. Kar, and P. V.
palliative care," BMJ, vol. 330, no. 7498, Ilavarasan, "Studying Online Political
pp. 1007–1011, Apr. 2005, doi: Behaviours as Rituals: A Study of Social
10.1136/bmj.330.7498.1007. Media Behaviour Regarding the CAA," in
[3] "The Danish Cancer Society," Re-imagining Diffusion and Adoption of
International. Information Technology and Systems: A
https://www.cancer.dk/international/abo Continuing Conversation, Cham, 2020,
ut-the-danish-cancer-society/ (accessed pp. 315–326, doi: 10.1007/978-3-030-
09th August, 2020). 64861-9_28.
[4] "WHO Coronavirus Disease (COVID-19) [11] S. Ebadollahi, J. Sun, D. Gotz, J. Hu, D.
Dashboard." https://covid19.who.int Sow, and C. Neti, "Predicting Patient's
(accessed 09th August, 2020). Trajectory of Physiological Data using
[5] G. Umefjord, K. Hamberg, H. Malker, Temporal Trends in Similar Patients: A
and G. Petersson, "The use of an Internet- System for Near-Term Prognostics,"
based Ask the Doctor Service involving Amia Annual Symposium, vol. 2010, pp.
family physicians: evaluation by a web 192–196, 2010.
survey," Fam Pract, vol. 23, no. 2, pp. [12] "Temporal disease trajectories condensed
159–166, Apr. 2006, doi: from population-wide registry data
10.1093/fampra/cmi117. covering 6.2 million patients | Nature
[6] G. Umefjord, H. Sandström, H. Malker, Communications."
and G. Petersson, "Medical text-based https://www.nature.com/articles/ncomms
consultations on the Internet: A 4-year 5022 (accessed 09th August, 2020).
study," International Journal of Medical [13] X. Ji, S. A. Chun, and J. Geller,
Informatics, vol. 77, no. 2, pp. 114–121, "Predicting Comorbid Conditions and
Feb. 2008, doi: Trajectories Using Social Health
10.1016/j.ijmedinf.2007.01.009. Records," IEEE Transactions on
[7] A. K. Kushwaha and A. K. Kar, NanoBioscience, vol. 15, no. 4, pp. 371–
"Language Model-Driven Chatbot for 379, Jun. 2016, doi:
Business to Address Marketing and 10.1109/TNB.2016.2564299.
Selection of Products," in Re-imagining [14] O. Frunza, D. Inkpen, and T. Tran, "A
Diffusion and Adoption of Information Machine Learning Approach for
Technology and Systems: A Continuing Identifying Disease-Treatment Relations
Conversation, Cham, 2020, pp. 16–28, in Short Texts," IEEE Transactions on
doi: 10.1007/978-3-030-64849-7_3. Knowledge and Data Engineering, vol.
[8] A. K. Kushwaha and A. K. Kar, "Micro- 23, no. 6, pp. 801–814, Jun. 2011, doi:
foundations of Artificial Intelligence 10.1109/TKDE.2010.152.
Adoption in Business: Making the Shift," [15] C. Lousteau-Cazalet et al., "A decision
in Re-imagining Diffusion and Adoption support system for eco-efficient
of Information Technology and Systems: biorefinery process comparison using a
A Continuing Conversation, Cham, 2020, semantic approach," Computers and
Electronics in Agriculture, vol. 127, pp.
510
351–367, Sep. 2016, doi:
10.1016/j.compag.2016.06.020. [26] "Classification of Diseases and their
[16] C. C. Yang and T. D. Ng, "Analyzing and Treatments Using Machine Learning
Visualizing Web Opinion Development Approach - ProQuest."
and Social Interactions With Density- https://search.proquest.com/openview/42
Based Clustering," IEEE Transactions on 3cca63369eb17808ce3e845e51b852/1?c
Systems, Man, and Cybernetics - Part A: bl=2029261&pq-origsite=gscholar
Systems and Humans, vol. 41, no. 6, pp. (accessed 12th August, 2020).
1144–1155, Nov. 2011, doi:
10.1109/TSMCA.2011.2113334.
[17] C. Manning, P. Raghavan, and H.
Schuetze, "Introduction to Information
Retrieval," p. 581, 2009.
[18] An algorithm for suffix stripping. 1980.
[19] J. A. Goldsmith, D. Higgins, and S.
Soglasnova, "Automatic Language-
Specific Stemming in Information
Retrieval," in Cross-Language
Information Retrieval and Evaluation,
Berlin, Heidelberg, 2001, pp. 273–283,
doi: 10.1007/3-540-44645-1_27.
[20] C. H. Porter, L. E. Lynch, J. A. Herrig,
and R. J. Ziebol, "(54) DEVICE AND
METHOD FORVASCULAR ACCESS,"
p. 60.
[21] S. E. Robertson and K. Spärck Jones,
"Simple, proven approaches to text
retrieval," University of Cambridge,
Computer Laboratory, UCAM-CL-TR-
356, 1994. Accessed: 10th August, 2020.
[Online]. Available:
https://www.cl.cam.ac.uk/techreports/U
CAM-CL-TR-356.html.
[22] S. E. Robertson and K. S. Jones,
"Relevance weighting of search terms,"
Journal of the American Society for
Information Science, vol. 27, no. 3, pp.
129–146, 1976, doi:
10.1002/asi.4630270302.
[23] S. Robertson, "Understanding inverse
document frequency: on theoretical
arguments for IDF," Journal of
Documentation, vol. 60, no. 5, pp. 503–
520, Jan. 2004, doi:
10.1108/00220410410560582.
[24] Kar, Arpan, "Applications of Machine
Learning in Business," Business
Frontiers, 24th July, 2020. .
[25] A. Kar, "Understanding Machine
Learning and Artificial Intelligence and
their effects on Financial Systems –
Business Fundas.".