<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>February</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Labelling of Medical Forum Posts by Non-Clinical Text Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amit Kumar Kushwaha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arpan Kumar Kar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology Delhi</institution>
          ,
          <addr-line>New Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Machine Learning</institution>
          ,
          <addr-line>Chatbot, Artificial Intelligence</addr-line>
          ,
          <institution>Medical</institution>
          ,
          <addr-line>Ontology</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Semantic Intelligence Conference</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>5</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>With the advent of web 2.0, modern societies produce a vast amount of data, and merely keeping up with storage and transmission is difficult; analyzing it to extract useful information has become further challenging. All the historical research in healthcare data processing is more concentrated on formal clinical data. There lies a lot of valuable yet idle lying data in the nonclinical information as well. The proposed study combines the state of the art methods within distributed computing, text retrieval, clustering methods, and finally, using a classification method to a computationally efficient system that can clarify cancer patient trajectories based on non-clinical and freely available online forum posts. The motivation is that informed patients, caretakers, and relatives often lead to better overall treatment outcomes due to enhanced possibilities of proper disease management. The resulting software prototype is fully functional and built to serve as a test bench for various text information retrieval and visualization methods. Via the prototype, we demonstrate a computationally efficient clustering of posts into cancer-types and subsequent within-cluster classification into trajectory related classes. The system also provides an interactive graphical user interface allowing end-users to mine and oversee the valuable information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Most of the
patients
that
acquire a
progressive and terminal disease are towards
the latter end of life. Some of these illnesses are
primarily respiratory disorders, cancers, and
cardiovascular. This illness implies a large time
frame for the patients themselves and the
surrounding relatives and caretakers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The
trajectory
of the timeframe
can
be
summarized as a sequence of steps shown in
very simple and compact, they are complex and
contain a range of concerns underneath each of
the four steps. For instance: life expectancy at
each
stage, patterns
decline, probable
with
other
health
services,
ISIC’21:
      </p>
      <p>International
EMAIL:Kushwaha.amitkumar@gmail.com;
Arpan_kar@yahoo.co.in</p>
      <p>2020 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
medicinal side effects, treatment plans and
costs at each stage, any other non-documented
side effects, palliative care, and many more.</p>
      <p>Mis-informed outputs can lead to costlier
yet un-successful
and
delayed
treatment.</p>
      <p>Scholarly outputs, in turn, can have clarified
trajectories of the timeframe and can lead to
better overall treatment owing to better clinical
sources and decisions. This further reduces the
possibilities of fewer re-admissions, decreased
health care costs, and higher quality of life for
patients in the potentially final weeks, months,
and years. Unlimitedly, better overall care is
obtainable via clarification during early stages,
estimation, and
communication
of
patientspecific symptoms and disease trajectories.</p>
      <p>The proposed study is motivated by the idea
of exploiting the relevant yet idle information
in the ever-increasing user-generated content
through online and freely accessible
nonclinical text for the benefit of anyone interested
in any clinical trajectory, e.g., cancer patients,
COVID-19 patients. With the recent increase
and rise in the overall COVID-19, there was
massive unrest during the initial stages of the
disease spread, where even the patients who
were tested positive were not sure about the
trajectories of the sequence of steps in figure 1.
This motivated me to further this study, which
can be an essential literature contribution for
researchers and act as a day-to-day practice
implication for someone who has internet but
cannot navigate through much-unstructured to
find simple, relevant clinical information.</p>
      <p>
        Historical data shows that approximately
one-third of the entire world's population gets
diagnosed with cancer during their lifetime [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
According to the World Health Organization
(WHO) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as of 9th August 20 globally, there
have 20 million patients tested positive for
COVID-19. Thus, a large community of
potential end-users can consume the
nonclinical data for answering their queries related
to clinical trajectories. A cancer diagnosis or in
recent times COVID-19 leads to several
reactions, a predominantly one is first sought
information online on specific symptom, type
and severity and finally trajectory prognosis.
      </p>
      <p>
        A trend that has recently gained prominence
among the community is to communicate on
online forums [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. On
these medical forums, people have the right to
write freely on their emotions and what they
feel about the disease, treatment, after-effects,
and normalcy after the treatment without
disclosing the identity. For instance: on cancer
forums, people write freely about their initial
stage frustrations, fears, and how they
overcame them. The same applies to
COVID19 forums too. Any healthcare system does not
leverage this freely available non-clinical,
nonetheless less potentially very relevant
information.
      </p>
      <p>Mining all such relevant information from a
wide variety, volume, and veracity of online
user-generated content on the forums is an
overlap of the technical-scientific research
domain. It is more challenging than mining
standard health texts such as electronic health
records (EHR), including hospital admission
500
journals that capture doctors' comments,
medical reports, and similarly discharge
summaries. In all these formal EHRs, the
language of cause, symptoms, cures, and
aftereffects is more concise, specific, and medical
terms are used more distinctly from case to
case. These terms are way different from a
layperson's mention of terms in the same
context on the online forums. This adds to our
motivation to make this non-clinical data
available for a person in general.</p>
      <p>The current research objective is to clarify
and communicate the patient trajectories at
each stage by computationally efficient text
information retrieval from non-clinical online
forum post texts. Through the current study, the
identified research objectives are met by
building a fully functional and generalizable
framework that can screen/filter, process, and
present the non-clinical data for clinical
trajectory in a visually and informative way.
The framework is chosen to act as a test bench
for future text information retrieval methods
and is not only restricted to the current study.
The current study's underlying premise is
unstructured inherent and valuable information,
which is freely available on non-clinical yet
medical forums.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Scholars' research with the objectives,
methods, and hypothesis rooted in data mining
has been mostly focusing on text
summarization. In 2005, Murray et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
performed the clinical review research that
summarizes three disease trajectories: organ
failure (heart and long), frail elderly, and
cancer. In another related study in 2010,
Ebadollahi et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] predicted a patient's
trajectory from temporal physiological data.
This study was further improved in a 2014
research undertaken by Jensen et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] with
the disease trajectory data spanning fifteen
years from a large patient population.
      </p>
      <p>
        In 2016, Ji et al.[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed a predictive
model for health condition trajectory and
comorbidity relationships by training the social
health records model. Another related study
was performed in 2017 by Jensen et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] using
text analysis using EHRs to predict patient
(cancer) trajectories automatically. However,
summarizing all the above work, we interpret a
gap in text information retrieval using
distributed clustering and classification. None
of the highlighted studies have ventured using
this framework, which can be computationally
efficient. At the end of the proposed current
framework, a classification model can quickly
identify patient trajectories using non-clinical
texts from online forums.
      </p>
      <p>
        Frunza et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] did a related study in 2011;
in their study, they automatically extract
sentences from clinical papers about diseases
and treatments. Based on the extracted
sentences, semantic relations between diseases
and associated treatments are then identified.
Another related study was done by Rosario et
al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in 2004. The focus of their work was to
recognize text-entities containing information
about diseases and treatments. They use Hidden
Markov Models and Maximum Entropy
Models to perform the entity and
diseasetreatment relationship recognition.
      </p>
      <p>Compared to Frunza et al., the later work
focuses mostly on classification. In the
proposed study, the present study also focuses
on text retrieval and clustering through the
current study. The current proposed study will
also focus on cancer and COVID-19
trajectories, where-in the other studies have
only focused on cancer as a prevalent disease.</p>
      <p>
        Lastly, in the 2011 study by Yang et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
Density-Based Clustering was used to identify
topics within online forum threads on social
media. They also developed a visualization tool
to provide an overview of the identified topics.
Their tool's purpose was to extract topics with
sensitive information related to terrorism or
other criminal activities; however, it might also
be tailored to extract other topics. Besides using
DBSCAN, the study proposed a related
clustering method, namely SDC (Scalable
Density-based clustering). The structure of the
Yang et al. study is, to some extent, as the
present study; individually, in the present study,
topics are also extracted from online forum
posts, density-based clustering is also used, and
result visualization capabilities are also
provided.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Research gap addressed by the novelties of the current work</title>
      <p>The novelty of the proposed study is
combining the state of the art un-supervised
501
distributed computing text retrieval through
clustering. This contribution is topped by a
coherent classification that is computationally
efficient and can identify patient trajectories
based on non-clinical texts. Hence, the outcome
of the proposed study provides a unique and
novel means for an individual and researchers
looking for cancer and COVID-19 trajectories.
This is done by activating relevant and
potentially hitherto overlooked, by the
established health care systems, information
hidden in non-clinical texts.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Significance</title>
      <p>In general, computational retrieval of
information from the vast amounts of health
care texts is significant. Specifically, for this
study, the significance lies in the systematic
combination of state-of-the-art methods to
mine, refine, categorize, and present
laypersons' cancer trajectory related
descriptions. It is significant to empower
patients and caretakers and help build healthy
patient/caretaker communities by leveraging
the soft information not hitherto used by the
established health care systems, e.g.,
information about emotions, feelings, or
personal preferences.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Proposed framework</title>
    </sec>
    <sec id="sec-6">
      <title>3.1. Overview</title>
      <p>The proposed study has four major building
blocks or components, including a database
component for storing the cluster outputs. A
detailed visual representation of the framework
is given in figure 1 below. It has been designed
in a micro-service architecture with one process
per component to make the framework light
from a production implementation standpoint.</p>
      <p>Search
Posts
Statistics
Clusters
Tools</p>
      <p>Database</p>
      <p>API</p>
      <p>Informati
on
retrieval
from text</p>
      <p>The left part of the framework in figure 1
above is the front-ending component that
handles the user interaction. We will be further
elaborating the same in section 3.2. The API
component's sole purpose is to enable the front
end component to interact with the database and
with other service components. The Database
component persists all gathered forum posts
and the computed results, e.g., clusters, classes,
and cancer-trajectories. The Service component
handles the computationally burdensome data
processing; the micro-service architecture
enables scaling of this component only.
Implementing the service component as a
scalable unit becomes well-suited for the
application of a distributed computing
approach. Especially the clustering calculations
are burdensome and need to be made efficient.
Currently, the text retrieval and classification
calculations do not need to be scaled as they are
much faster than the clustering.
3.2.</p>
    </sec>
    <sec id="sec-7">
      <title>Front-end</title>
      <p>Having a front end to interact with data
helps to explore results from the end-user's
perspective. The developed user interface is
useful for exploring the collected data set of
forum posts and to show information from an
area of interest. For instance: a user can select a
cluster, i.e., a disease-type, of interest, e.g.,
COVID-19 or lymphoma cancer, and only
receive posts within that cluster. A user can also
choose a pivot of information, e.g., side effects
of COVID-19 medicine or side effects of cancer
radiation, and thereby see all posts from the
cancer cluster or COVID cluster that contains
information about side effects. Such a tool is
relevant for scientific use and cancer patients
and caretakers.</p>
      <p>Need Help !!</p>
      <p>Please type your query in the text box below</p>
      <p>The user interface consists of five main
views: Search, Posts, Statistics, Clusters, and
Tools (figure 2). In the Search view, a user can
search the entire collection of forum posts, the
identified clusters. Initially, a view of the types
of clusters, as shown in figure 3, will be
displayed to the end-user. By clicking a type
cluster, all posts associated with that type of
cluster is displayed in the Posts view. Users can
browse through the posts within a type of
cluster and by selecting a class-label.</p>
      <p>Clusters
Kidney – 160 Posts</p>
      <p>Statistics
Heart - 70 Posts</p>
      <p>Liver - 150 Posts</p>
      <p>SCiudreeef ect DNiosecausree Treatment SCiudreeef ect DNiosecausree Treatment
Figure 3: Cluster view
SCiudreeef ect DNiosecausree Treatment</p>
    </sec>
    <sec id="sec-8">
      <title>3.3. Functional validation of the outputs</title>
      <p>The robustness of a framework is considered
based on the statistical metrics and needs to be
measured on the intuitiveness of the text
outputs as received by the users. Hence in order
to concretely measure the outputs of the
framework, we check the output from the below
qualitative lens as well other than the statistical
metrics:
Functional intuitiveness:
• Appropriate,
• Suitable
Performance:
• Time,
• Utilization</p>
    </sec>
    <sec id="sec-9">
      <title>4. Information retrieval</title>
    </sec>
    <sec id="sec-10">
      <title>4.1. Data collection</title>
      <p>The data-set has been created by collecting
the texts from online posts on the medical
forums non-clinical. The posts are mostly
written by person in-general and not doctors or
medical staff. Hence the topics and words used
are more day-to-day life and less skewed
towards specific medical terminologies.
Typically, the data collected from posts will
consist of symptoms, initial experiences,
treatments, place where treated, post-treatment
experience, questions, side-effects, and
outcomes. The most informative and
unstructured data is stored in the actual text of
each row. This text's basis, the information
retrieval framework proposed in the current
study, extracts the relevant features for
clustering. Often, these non-clinical texts
captured contain rather detailed descriptions of
a disease (like cancer or COVID-19) and the
specific treatment received.
4.2.</p>
    </sec>
    <sec id="sec-11">
      <title>Data preprocessing</title>
      <p>To ensure that the actual text information
retrieval works successfully, the collected text
needs to be preprocessed and cleansed for any
noise in the data. For the proposed research, we
have conducted three preprocessing steps:
1. Cleansing,
2. Stemming, and
3. Tokenization</p>
      <p>
        The first step of cleansing consists of
processes to remove unwanted characters, e.g.,
HTML tags, emojis, and ASCII-artworks. This
is a non-trivial task when dealing with forum
posts as people express themselves quite
informally. In the second step of the stemming
part, inflected and derived words are reduced to
503
their word stem [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Different algorithms for
stemming exist in the literature, e.g., the Lovins
Stemmer [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the Paice Stemmer [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and the
predominant Porter Stemmer [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. All these
stemming algorithms are best suited for
English; in the present study, the Porter
Stemmer is used. The Porter Stemming
algorithm is based on five steps, and in each
step, a specified set of rules is applied to the
word being processed. For instance, the first
step contains the following processing rules, as
represented in figure 4. In the tokenization part,
character and word sequences are sliced into
tokens. Typically, the tokens are words or
terms, but in this study, tokens are only words.
After the tokenization, stop words are removed.
4.3.
      </p>
    </sec>
    <sec id="sec-12">
      <title>Information retrieval</title>
      <p>
        To make sure that the clustering of posts into
a specific type of disease clusters to be accurate,
information from all the collected posts' content
attributes must be extracted. This is achieved by
using various natural language processing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
and text retrieval, together with a predefined
feature vector containing names of a range of
disease types. For current work, we use the term
weighting approach. This approach uses term
frequency and inverse document frequency to
yield term frequency-inverse document
frequency, which is the term's final weight. The
purpose of term frequency (tf) is to measure
how often a term occurs in a specific text
corpus, i.e., in this study, tf is simply an
unadjusted count of term appearances.
      </p>
      <p>
        Term frequency [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] can be defined as
tf(t,d) | occurrences of term t in document d.
Documents vary in length, which entails a bias
in tf; that is, a term is likely to appear more
often in a lengthy document than in a short
document, given the documents are similar in
content [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Whenever a term is frequent in a
document, it is likely to be relevant to that
specific document. The purpose of inverse
document frequency (IDF) is to measure the
weight of a term in a collection of documents; a
rare term is often more valuable than a common
term in a collection of documents [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Term frequency-inverse document
frequency (tf-IDF) is a measure of how
important a word is to a specific document in a
collection of documents. A significant tf-idf
weight is obtained whenever: 1. the term
frequency is high for the specific document, and
2. the document frequency is low for the term
across the collection of documents. Combining
the tf and IDF weights tends to filter out
standard terms that do not carry much
information [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>5. Clustering</title>
    </sec>
    <sec id="sec-14">
      <title>5.1. Existing DBSCAN clustering</title>
      <p>Clustering is a process of grouping
unlabeled data into clusters of homogenous
attributes. The data points in each cluster have
similar traits, such that the variance
withincluster is minimum, and variance across
clusters is maximum. In the proposed study, a
cluster would represent a homogenous group of
similar texts from posts. Density-Based Spatial
Clustering of Applications with Noise
(DBSCAN) is a clustering algorithm based on
data points' density (also known as
observations). DBSCAN helps to create
clusters with a high density of data points, and
in doing so, it allows clusters of any shape even
if it contains noise, which is slightly different in
approach compared to conventional clustering
algorithms.</p>
      <p>DBSCAN can now find clusters of
different sizes and skip the input of taking the
number of clusters beforehand. In DBSCAN,
the Ɛ-neighborhood of point p will be defined
by the points within a radius Ɛ of p. If a point
p's Ɛ-neighborhood contains at least mpts
number of points, the point p is called a core
point. A data point is called noise if it is not a
core point. A point p is in the density-range
from a point q if p is within the Ɛ-neighborhood
of q, and q is a core point.</p>
      <p>A point p is defined as in the density range from
a point q with regard to Ɛ and mpts if there is a
chain of points, p1,…..,pn, where p1 = q and pn
= p such that pi+1 is in direct density range
from pi. A point p is defined as a
densityconnected point to another point q with regards
to Ɛ and mpts if only there is a point o such that
both p and q are density-reachable from o. A
point p is a border point if p's Ɛ neighborhood
contains less than mpts, and p is in direct density
from a core point. A cluster C is a non-empty
set that satisfies the following two conditions
for all point pairs (p;q):
1. If p is in C and q is density-reachable from
p, then q is also in C; and
2. If (p; q) is in C, then p is density-connected
to q.</p>
      <p>To create a cluster, the DBSCAN algorithm
initiated an arbitrary point p and searched for all
the points in the density range of p with respect
to Ɛ and mpts. If p is a core point, then a new
cluster with p as a core point is created. If p is a
border point, DBSCAN browses the next point
in the sample. DBSCAN can also merge any
two clusters into one of these clusters are in the
same density range. The algorithm will
converge when no new points can be added to
any existing or new clusters.</p>
    </sec>
    <sec id="sec-15">
      <title>5.2. MapReduce clustering</title>
    </sec>
    <sec id="sec-16">
      <title>DBSCAN</title>
      <p>The entire process of DBSCAN clustering is
computationally costly with high time and
memory consumption. To reduce this
consumption and increase efficiency,
MapReduce DBSCAN was proposed. The only
difference between a regular DBSCAN
clustering and DBSCAN via MapReduce is
through distribution computation. The steps
followed in a MapReduce DBSCAN can be
shown in figure 4 below.</p>
      <p>Database
Partition
DBSCAN
Data
mapped
to cluster
Merging
Map the profile</p>
    </sec>
    <sec id="sec-17">
      <title>MapReduce</title>
    </sec>
    <sec id="sec-18">
      <title>5.3. Partition</title>
    </sec>
    <sec id="sec-19">
      <title>DBSCAN clustering in</title>
      <p>To maximize runtime efficiency through
invoking, parallel processing can be achieved if
the data is well balanced. If the data is well
balanced, then the computational load can be
evenly distributed on computer nodes'
execution. In real-life text data, it is usually
unbalanced, and the best strategy to deal with this
is using data portioning. This is an inherent part
of MapReduce DBSCAN.</p>
      <p>Recursive split is the best and frequently
used data partitioning method, which helps split
the entire bigger data-set into smaller subsets.
This is done recursively till a stop criterion is
met. All partitions then contain less than a given
number of points, or a given number of
partitions have been made. Logically, a
partition cannot be smaller than 2Ɛ; when a
partition is split, the geometry must remain
extended beyond 2Ɛ. When splitting a partition
into two in MapReduce DBSCAN, all possible
splits are considered. The split that minimizes
the loss in one of the sub-partitions is chosen.
Here, the loss is calculated as the difference
between the number of points in
sub-partition1 and half of the number of points in
subpartition-2. Each partition is given a key and
associated with a reducer.
5.4.</p>
    </sec>
    <sec id="sec-20">
      <title>Local DBSCAN</title>
      <p>Continuing the definition of reducer from
the previous paragraph, each reducer will be
given a partition and all its associated data
points, and hence a mapper should prepare all
data related to a partition. Explaining the same
concept using an example: the data assigned
would be the related data Ci within Pi, and the
data within Pi's Ɛ-width extended partition Ri
that overlap the bordering partitions.</p>
      <p>Local DBSCAN borrows the working
principles from the original DBSCAN to
perform the clustering. It starts with an arbitrary
data point p belonging to Ci and searches for
points in the density of p with respect to Ɛ and
mpts. If p is a core point, the Ɛ neighborhood will
be explored for data points. If Local DBSCAN
finds a point in the outer margin directly in the
density range from a point in the inner margin,
it is added to the merge-candidate set. If a core
point is in the inner margin, it is also added to
the merge-candidate set. Each point in the
cluster is given a local cluster-id generated and
mapped from partition id and the label id from
the local clustering.
5.5.</p>
    </sec>
    <sec id="sec-21">
      <title>Mapping profile</title>
      <p>After each partition has undergone
clustering and merge candidate lists have been
generated, the merge candidate lists are
collected to a single merge candidate list. The
basics of merging the clusters from the different
505
partitions are 1. Execute a nested loop on all
points in the collected merge candidate lists to
see if the same data points exist with different
local cluster IDs; 2. If found, then merge the
clusters.</p>
      <p>Figure 5 illustrates two examples of
clustermerge propositions. Example 1: the points d1
belong to C1, and d2 belong to C2 are core
points, and d2 is directly density-reachable from
d1; thus, C1 should merge with C2. Example 2:
The point d3 belongs to C1 is a core point, and r
belongs to C2 is a border point; thus, C1 should
not merge with C2.</p>
      <p>Mapping Profile step where the purpose is to
create a profile that maps clusters that should be
merged. The algorithm for generating the
mapping profile is represented in the algorithm
in figure 5. The output of the algorithm is a list
of pairs of local clusters to be merged (denoted
MP) and a list of border points (denoted BP); a
point p is at least a border point in a merged
cluster (this is taken care of in the next step).
1. for each cp in CP do
2. for each bp in BP do
3. if cp.id == bp.id then
4. MP.add ((cp.local cluster id),
5. (bp.local cluster id))
6. BP.delete(bp)
7. end if
8. end for
9. end for</p>
      <p>The previous step resulted in a list of pairs
of clusters to be merged. The IDs of the local
clusters should be changed into a unique global
ID after merging. Thus, a global perspective of
all local clusters is built (algorithm in figure 6).
Lastly, as mentioned in the previous step, noise
points are set to border points.</p>
    </sec>
    <sec id="sec-22">
      <title>6. Classification</title>
      <p>The result of the clustering is a set of
specific disease type clusters. To enable further
filtering possibilities for the end-user, a
withincluster classification is conducted such that
each post within a disease type cluster is labeled
with one of the six labels illustrated in table 1.
This allows an end-user to filter the forum posts
such that, for instance, only posts with specific
disease (cluster) treatments (class) are shown.</p>
      <p>
        We have chosen to classify with a Naive
Bayes classifier trained with a manually created
training set augmented with the freely available
set from the BioText Project, UC, Berkeley
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The Frunza et al. study also uses a Naive
Bayes classifier with promising results [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
However, they classified abstracts from
scientific articles, which is a somewhat
different data-domain than the present study's
non-clinical texts. The time complexity for
training a Naive Bayes classifier is O(np),
where n is the number of training observations,
and p is the number of features; thus,
disregarding the constant, the complexity is in
terms of observations O(n). When testing,
Naive Bayes is also linear, which is optimal for
a classifier.
      </p>
      <sec id="sec-22-1">
        <title>Class label</title>
      </sec>
      <sec id="sec-22-2">
        <title>Cure</title>
      </sec>
      <sec id="sec-22-3">
        <title>No cure</title>
      </sec>
      <sec id="sec-22-4">
        <title>Class</title>
        <p>description
with
example
posts in
italics</p>
      </sec>
      <sec id="sec-22-5">
        <title>About cancercuring treatments.</title>
        <p>After 16
chemo
sessions, my
cancer was
gone.</p>
      </sec>
      <sec id="sec-22-6">
        <title>About cancer noncuring treatments.</title>
        <p>My husband
went
through
chemo since
he had
bladder
cancer.</p>
        <p>Sadly, he
passed.
6.1.</p>
      </sec>
    </sec>
    <sec id="sec-23">
      <title>Clustering</title>
      <p>MapReduce DBSCAN is a distributed
extension of DBSCAN, and they use the same
principle for clustering. Thus, given the same
input, the two clustering methods should yield
the same output. The results in this section
show that this is indeed the case, and we thereby
consider the implementations of MR-DBSCAN
and DBSCAN to be verified in terms of the
correctness of the logical output. The actual
implementations do not share code, so it seems
fair to disregard the odd risk of having both
implementations wrong in a manner that lead to
the same output.</p>
      <p>For comparing the clustering results of
DBSCAN and MR-DBSCAN, the Adjusted
Rand Index (ARI) [30] is used. The index is a
similarity measure between two clusterings,
and it is obtained by counting the number of
identical labels assigned to the same clusters vs.
the number of identical labels assigned to
different clusters. If the label assignments
coincide fully, the index is 1, and if they do not
coincide at all, the index is 0. If DBSCAN and
MR-DBSCAN are implemented correctly, the
ARI must be one regardless of: 1. the number
of points in the data set, 2. The number of
partitions in MR-DBSCAN, and 3. the
parameter settings for Ɛ and mpts. Also, the
number of partitions (#P) in MapReduce
DBSCAN, the coverage percentage (%C), and
the number of labels (#L) in DBSCAN and
MapReduce DBSCAN have been recorded.
The results show (Table 2) that the ARI is 1 in
all 18 test cases; a necessary condition for this
is that both MR-DSBCAN and DBSCAN yield
the same number of labels in all the tests also
the case (table 2).</p>
      <p>Also, MR-DBSCAN has been partitioning
its data into 3-8 partitions (table 2), which
means that even though the data has been split
and clustered individually per partition, the
merging works as intended and yields the same
clustering as DBSCAN. The coverage
percentage value is also identical for the two
clusterings in all test cases.
of</p>
    </sec>
    <sec id="sec-24">
      <title>6.2. Real-time</title>
    </sec>
    <sec id="sec-25">
      <title>MapReduce DBSCAN analysis</title>
      <p>The motivation behind the proposed study is
to demonstrate the real-time application of each
of the MapReduce DBSCAN steps under
variations in
1. the number of forum posts, and
2. the neighborhood radius Ɛ.</p>
      <p>These two parameters have the most
significant influence on MapReduce
DBSCAN's runtime. The Ɛ parameter is used
when partitioning the data set, and therefore, it
directly influences the beneficial effects of
MapReduce. In all tests, the lower point-count
threshold for establishing a core point, mpts, is
fixed to 5 points. This is done as the parameter
only has very little runtime influence, and this
influence is isolated to the DBSCAN step, i.e.,
it does not highlight runtime differences
between DBSCAN and MapReduce DBSCAN.</p>
      <p>For all 30 test cases (table 3), mapping takes
almost no time; merging has also only a little
effect on runtime. For relatively large values of
Ɛ, i.e., 1 and 0.1, compared to the data span,
MapReduce DBSCAN cannot partition the data
set well. This affects the runtime as the
clustering is then performed on a single
partition (or very few), and no MapReduce
improvements are achieved. For relatively
small values of Ɛ, i.e., 0.001 and 0.0005, the
data set is split well into partitions, but due to
the low value of Ɛ there are many possible
partitions, and much time is spent in search of
the best partitioning. Thus, as the results show,
the partitioning becomes slower when "
decreases, but the local DBSCAN becomes
faster. Hence, Ɛ needs to be set with care to
strike a balance and minimize the total runtime
of MapReduce DBSCAN. In our experiments,
the balance is Ɛ = 0.01; here, the partitioning
runtime is relatively low, and likewise for the
local DBSCAN; this results in a relatively low
total runtime.
6.3.</p>
    </sec>
    <sec id="sec-26">
      <title>Validation of clustering</title>
      <p>The purpose of this experiment is to
compare time as a function of the number of
forum posts of the three different clustering
algorithms DBSCAN, MapReduce DBSCAN,
and Hierarchical Density Estimates DBSCAN.
Algorithm parameters are fixed and equal
across the tests in order not to bias the results.
Specifically, the lower point-count threshold
for establishing a core point mpts = 50 and the
neighborhood radius Ɛ = 0.01 for all tests. Note
that the setting Ɛ = 0.01 was previously found
(section 7.2) to be a suitable choice for
MapReduce DBSCAN. The data set in this
experiment are various subsets of the collected
forum posts; the number of tf-idf features has
been limited to 1000. The results of all tests are
reported in table 3 and figure 7.</p>
    </sec>
    <sec id="sec-27">
      <title>7. Discussion</title>
      <p>The primary motivation of the proposed
work and research undertaken to mine the
clinical or medical information from non-clinal
posts collected from forums is valuable and
508
worth making available to others in a more
structured form. In the proposed study, this is
achieved by a decision support system that can
act as a source of information to help any
disease patients like COVID-19, cancer and
their caretakers and families to learn about the
disease trajectories, initial symptoms,
diagnoses outcomes, sources, treatment centers,
treatment is taken, after-effects of treatment and
costs.</p>
      <p>Through the non-clinical posts on
forums, the information retrieval framework
using text-retrieval, unsupervised clustering,
and a classification model. The framework is
designed to execute on a distributed computing
set-up like MapReduce to increase
computational efficiency. The response time of
a computationally costly clustering on texts
improves a lot, needed for a real-time
application.</p>
      <p>Moreover, the endpoint of the current
framework to the customer is a user interface
that enables the end-user to interact with the
database and mine for valuable information to
understand the overall trajectory of any disease.
This helps the patient be in a frame of mind
before getting a doctor's consultation and word.
This framework will also mobilize online social
communities of patients and their caretakers,
families using soft information and
nonclinical, hitherto conversations.</p>
      <p>The proposed framework through the study
is an excellent contribution to the existing
literature in several different ways. Adding,
refining, and benchmarking more clustering
and classification methods would yield more
comprehensive information through
nonclinical texts that might lead to better results,
i.e., more accurate clustering and
classifications, and thus, ultimately, a better
end-user service. The classification would
mainly be of interest to collect and use a more
extensive training set. The response time of
DBSCAN and Hierarchical Density Estimates
DBSCAN clustering has been improved by
redesigning the algorithms to guarantee upper
bounds on memory consumption. This can act
as a reference in literature for future
researchers.</p>
      <p>Lastly, in conclusion, the proposed system
and framework is easily generalizable such that
it readily can be applied in other domains
besides COVID-19 or cancer; by quickly
loading new data-sets and associated
featurevectors.</p>
    </sec>
    <sec id="sec-28">
      <title>8. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jensen</surname>
          </string-name>
          et al.,
          <article-title>"Analysis of free text in electronic health records for identification of cancer patient trajectories,"</article-title>
          <source>Scientific Reports</source>
          , vol.
          <volume>7</volume>
          , no.
          <issue>1</issue>
          , art. no.
          <issue>1</issue>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2017</year>
          , doi: 10.1038/srep46226.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kendall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          ,
          <article-title>"Illness trajectories and palliative care,"</article-title>
          <source>BMJ</source>
          , vol.
          <volume>330</volume>
          , no.
          <issue>7498</issue>
          , pp.
          <fpage>1007</fpage>
          -
          <lpage>1011</lpage>
          , Apr.
          <year>2005</year>
          , doi: 10.1136/bmj.330.7498.1007.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] "The Danish Cancer Society," International</article-title>
          . https://www.cancer.dk/international/abo
          <article-title>ut-the-danish-cancer-society/ (accessed 09th</article-title>
          <string-name>
            <surname>August</surname>
          </string-name>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] "WHO Coronavirus Disease (COVID-19) Dashboard</article-title>
          ." https://covid19.who.
          <source>int (accessed 09th August</source>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Umefjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hamberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Malker</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Petersson, "The use of an Internetbased Ask the Doctor Service involving family physicians: evaluation by a web survey,"</article-title>
          <source>Fam Pract</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>166</lpage>
          , Apr.
          <year>2006</year>
          , doi: 10.1093/fampra/cmi117.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Umefjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sandström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Malker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Petersson</surname>
          </string-name>
          ,
          <article-title>"Medical text-based consultations on the Internet: A 4-year study,"</article-title>
          <source>International Journal of Medical Informatics</source>
          , vol.
          <volume>77</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>121</lpage>
          , Feb.
          <year>2008</year>
          , doi: 10.1016/j.ijmedinf.
          <year>2007</year>
          .
          <volume>01</volume>
          .009.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kushwaha</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <article-title>"Language Model-Driven Chatbot for Business to Address Marketing and Selection of Products," in Re-imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation</article-title>
          , Cham,
          <year>2020</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>28</lpage>
          , doi: 10.1007/978-3-
          <fpage>030</fpage>
          -64849-
          <issue>7</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kushwaha</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <article-title>"Microfoundations of Artificial Intelligence Adoption in Business: Making the Shift," in Re-imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation</article-title>
          , Cham,
          <year>2020</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>260</lpage>
          , doi: 10.1007/978-3-
          <fpage>030</fpage>
          - 64849-7_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kushwaha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vigneswara</surname>
          </string-name>
          <string-name>
            <surname>Ilavarasan</surname>
          </string-name>
          ,
          <article-title>"Predicting Information Diffusion on Twitter a Deep Learning Neural Network Model Using Custom Weighted Word Features," in Responsible Design, Implementation and Use of Information and Communication Technology</article-title>
          , Cham,
          <year>2020</year>
          , pp.
          <fpage>456</fpage>
          -
          <lpage>468</lpage>
          , doi: 10.1007/978-3-
          <fpage>030</fpage>
          -44999-5_
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A. K. Kushwaha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mandal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pharswan</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          <string-name>
            <surname>Kar</surname>
            , and
            <given-names>P. V.</given-names>
          </string-name>
          <string-name>
            <surname>Ilavarasan</surname>
          </string-name>
          ,
          <article-title>"Studying Online Political Behaviours as Rituals: A Study of Social Media Behaviour Regarding the CAA," in Re-imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation</article-title>
          , Cham,
          <year>2020</year>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>326</lpage>
          , doi: 10.1007/978-3-
          <fpage>030</fpage>
          - 64861-9_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ebadollahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sow</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Neti</surname>
          </string-name>
          ,
          <article-title>"Predicting Patient's Trajectory of Physiological Data using Temporal Trends in Similar Patients: A System for Near-Term Prognostics,"</article-title>
          <source>Amia Annual Symposium</source>
          , vol.
          <year>2010</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>196</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <article-title>"Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients | Nature Communications." https://www</article-title>
          .nature.
          <source>com/articles/ncomms 5022 (accessed 09th August</source>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Chun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          ,
          <article-title>"Predicting Comorbid Conditions and Trajectories Using Social Health Records,"</article-title>
          <source>IEEE Transactions on NanoBioscience</source>
          , vol.
          <volume>15</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>371</fpage>
          -
          <lpage>379</lpage>
          , Jun.
          <year>2016</year>
          , doi: 10.1109/TNB.
          <year>2016</year>
          .
          <volume>2564299</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Frunza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Inkpen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>"A Machine Learning Approach for Identifying Disease-Treatment Relations in Short Texts,"</article-title>
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>801</fpage>
          -
          <lpage>814</lpage>
          , Jun.
          <year>2011</year>
          , doi: 10.1109/TKDE.
          <year>2010</year>
          .
          <volume>152</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lousteau-Cazalet</surname>
          </string-name>
          et al.,
          <article-title>"A decision support system for eco-efficient biorefinery process comparison using a semantic approach," Computers and Electronics in Agriculture</article-title>
          , vol.
          <volume>127</volume>
          , pp.
          <fpage>351</fpage>
          -
          <lpage>367</lpage>
          , Sep.
          <year>2016</year>
          , doi: 10.1016/j.compag.
          <year>2016</year>
          .
          <volume>06</volume>
          .020.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>"Analyzing and Visualizing Web Opinion Development and Social Interactions With DensityBased Clustering,"</article-title>
          <source>IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1144</fpage>
          -
          <lpage>1155</lpage>
          , Nov.
          <year>2011</year>
          , doi: 10.1109/TSMCA.
          <year>2011</year>
          .
          <volume>2113334</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Schuetze</surname>
          </string-name>
          , "Introduction to Information Retrieval," p.
          <fpage>581</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Goldsmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Higgins</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Soglasnova</surname>
          </string-name>
          ,
          <article-title>"Automatic LanguageSpecific Stemming in Information Retrieval," in Cross-Language Information Retrieval</article-title>
          and Evaluation, Berlin, Heidelberg,
          <year>2001</year>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>283</lpage>
          , doi: 10.1007/3-540-44645-1_
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Herrig</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Ziebol</surname>
          </string-name>
          ,
          <article-title>"(54) DEVICE</article-title>
          AND
          <string-name>
            <surname>METHOD FORVASCULAR ACCESS</surname>
          </string-name>
          ," p.
          <fpage>60</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Spärck Jones</surname>
          </string-name>
          ,
          <article-title>"Simple, proven approaches to text retrieval,"</article-title>
          University of Cambridge, Computer Laboratory,
          <string-name>
            <surname>UCAM-CL-TR356</surname>
          </string-name>
          ,
          <year>1994</year>
          .
          <source>Accessed: 10th August</source>
          ,
          <year>2020</year>
          . [Online]. Available: https://www.cl.cam.ac.uk/techreports/U CAM-CL-TR-
          <volume>356</volume>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>"Relevance weighting of search terms,"</article-title>
          <source>Journal of the American Society for Information Science</source>
          , vol.
          <volume>27</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>1976</year>
          , doi: 10.1002/asi.4630270302.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <article-title>"Understanding inverse document frequency: on theoretical arguments for IDF,"</article-title>
          <source>Journal of Documentation</source>
          , vol.
          <volume>60</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>520</lpage>
          , Jan.
          <year>2004</year>
          , doi: 10.1108/00220410410560582.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Kar</surname>
          </string-name>
          , Arpan,
          <article-title>"Applications of Machine Learning in Business,"</article-title>
          <source>Business Frontiers, 24th July</source>
          ,
          <year>2020</year>
          . .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Understanding Machine Learning and Artificial Intelligence and their effects on Financial Systems - Business Fundas.".</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <article-title>"Classification of Diseases and their Treatments Using Machine Learning Approach</article-title>
          - ProQuest." https://search.proquest.com/openview/42 3cca63369eb17808ce3e845e51b852/1?c bl=
          <volume>2029261</volume>
          &amp;
          <article-title>pq-origsite=gscholar (accessed 12th</article-title>
          <string-name>
            <surname>August</surname>
          </string-name>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>