=Paper=
{{Paper
|id=Vol-1276/MedIR-SIGIR2014-07
|storemode=property
|title=Exploring Clustering Based Knowledge Discovery towards Improved Medical Diagnosis
|pdfUrl=https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-07.pdf
|volume=Vol-1276
|dblpUrl=https://dblp.org/rec/conf/sigir/PrasathO14
}}
==Exploring Clustering Based Knowledge Discovery towards Improved Medical Diagnosis==
<pdf width="1500px">https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-07.pdf</pdf>
<pre>
Exploring Clustering Based Knowledge Discovery towards
              Improved Medical Diagnosis

                        Rajendra Prasath                                         Philip O’Reilly
                 Dept. of Business Information                            Dept. of Business Information
                Systems, University College Cork                           Systems, University College
                            Ireland                                                Cork, Ireland
                        R.Prasath@ucc.ie                                    Philip.OReilly@ucc.ie

ABSTRACT                                                               medical practitioners to make more informed decisions on
We propose to develop a framework for an intelligent                   disease management.        Illustrating the role of machine
reasoner with capabilities that support complex decision               learning as an enabler of this for more informed decision
making processes in medical diagnosis. Identifying the                 support is a key aspect of this paper. In this paper, we
causes, reasoning the effects to explore information geometry          present an approach to identify causes and effects that
and learning the associated factors, from medical forum                exist across different diseases using a graph clustering based
information extracted, are the core aspects of this work. As           knowledge discovery approach.
part of the proposed framework, we present an approach                    Understanding causation and correlation between
that identifies semantically similar causes and effects for            individual factors is key to decision making in multiple
any specific disease from medical diagnosis literature using           domains including medicine and business. Traditionally,
implicit semantic interconnections among the medical terms.            such association between elements has been identified
First we crawled MedHelp1 forum data and considered two                through human interpretation of text. However, this is
types of information: forums data and posts data. Each                 a very time consuming, manual, labour intense process
forum link points to a specific disease and consists of several        and is limited by human capacity. The ability to mine
topics pertaining to that disease. Each topic consists of              textual content, identify association and the nature of
multiple posts that carry either users’ queries/difficulties           that association between elements using machine learning
or doctor’s feedback pertaining to the issue(s) of the users.          techniques provides significant opportunities. Specifically
We use graph based exploration on the terms (diseases) and             in the medical domain, where a significant amount of
their relations (in terms of causes/effects) and explore the           content is textual in nature (e.g. medical notes), having
information geometry pertaining to similar diseases. We                the ability to identify causation and correlation between
performed a systematic evaluation to identify the relevance            elements in large medical datasets provides significant
of the contextual information retrieved for a specific disease         opportunity for advancing medical research and enabling
and similar factors across different diseases. The proposed            better decision making pertaining to condition diagnosis and
approach looks promising in capturing similar causes and/or            patient management.
effects that pertain to multiple diseases. This would enable              In this paper, we attempt to identify and create
medical practitioners to have a multi-faceted view of a                term clusters using graph clustering approach and then
specific disease/condition.                                            perform topic classification. This approach improves the
                                                                       document classification task by putting the terms, that are
                                                                       semantically related, in the same cluster. Users could search
                                                                       for a specific disease and explore information pertaining to
Keywords                                                               its causes and effects by means of semantically related texts.
Causes and Effects, Medical Diagnosis, Semantically Similar
diseases, Information Geometry, Graph Analysis
                                                                       2. CLUSTERING BASED KNOWLEDGE
1. INTRODUCTION                                                           DISCOVERY
  Understanding the causes and effects pertaining to                     To incorporate natural language understanding, common-
a specific disease is key to better prediction in                      sense and domain specific knowledge could be used to
medical diagnosis and improved patient management.                     improve the text representation by including more generated
Diseases/Conditions may have similar or semantically                   informative features to perform deep understanding of the
related causes and effects. Furthermore, gaining insight               document text than the mere Bag-of-Words approach [4,
on how diseases are diagnosed and managed would enable                 2, 8].      Mitra et al. [6] proposed an unsupervised
                                                                       feature selection algorithm suitable for data sets, based on
1
    http://www.medhelp.org/forums/list                                 measuring similarity between features whereby redundancy
                                                                       therein is removed. Pedersen and Kulkarni [7] presented
                                                                       a system called SenseClusters to cluster similar contexts
                                                                       in natural language text and assigns identifying labels
Copyright is held by the author / owner(s)
MedIR 2014, July 11, 2014, Gold Coast, Australia.                      to these clusters based on their content. In addition
ACM SIGIR.                                                             to clustering similar contexts, it can be used to identify


                                                                  28
synonyms and sets of related words2 . To incorporate                       each unique term, the list of documents in which it occurs is
this kind of additional common-sense/domain knowledge,                     retrieved. Using this data, we build the weighted graph in
Gabrilovich and Markovitch [3] used world knowledge                        which the edge between two terms would represent their
from open source knowledge repository like Wikipedia to                    semantic association implicitly. This process is repeated
generate additional features. Similar intuition is adopted                 for all features and a weighted graph for the overall data
to form word clusters that generate features enriching the                 is constructed. Thus the problem is modeled into a graph
document content in a better way. Then the documents                       clustering problem. This results in a graph G = (V, E, A)
are represented in the knowledge-rich space of generated                   where |V | = n represents the number of unique terms; |E|
features. This leads to better organization of semantically                represents the number of edges and the adjacency matrix;
related text representation. In this proposed scheme, given                and A is |V | × |V | whose nonzero entries correspond to
a knowledge repository, the text documents are examined                    the edge weight between a pair of terms (adjacency list is
and their representation is enriched in a completely                       assumed in case of sparse matrix - in this case, number of
mechanical way. Motivated by the above considerations,                     rows in the graph represents the total number of terms in
our aim is to empower machine learning techniques for text                 the graph).
representation with a substantially wider body of knowledge                   We use the kernel-based multilevel clustering algorithm
like the one obtained from the superior inference capabilities             proposed by Dhillon et al. [1] on the weighted input graph
of humans.                                                                 with the number of desired partitions. This algorithm uses
                                                                           three steps: coarsening, base-clustering, and refinement. In
2.1 Mathematical Formulation                                               coarsening, the given graph is repeatedly transformed into
   In this section, we characterize the medical forum data                 smaller subgraphs. This process is repeated until a few
in a formal way. Each post pertaining to a specific topic                  nodes remain in the graph. Then during base clustering,
is informally written and we focus on terms and their co-                  regional growing approach [5] could be used with these few
occurrences. We have n textual descriptions, viz-a-viz,                    nodes. The quality of the resulting clusters depends on the
posts: P = {p1 , p2 , · · · , pn } and each post can be formally           choice of the initial nodes. The refinement process is applied
represented as a sequence of terms, illustrating the scenario              as follows: If a node in Gi is in cluster c, then all nodes
of the underlying disease, as follows: pi = {t1 , t2 , · · · , tm }        in Gi−1 formed from that node are in cluster c. For more
where 1 ≤ i ≤ m and m varies differently for different post.               details, please refer to [1, 5].
We convert the entire text data of all posts into a graph, say                In this work, we generate term clusters from extracted
G = (V, E) where V represents the set of nodes (each term                  word graphs, using co-occurrence information of terms. The
in P is considered as a node) and the co-occurrence of any                 task is to partition the graph into clusters so that terms
pair of nodes across the posts is considered to be the edge                could be grouped into a few subsets and dimensionality of
representing the strength of association between the pair of               the new term space is reduced. Now based on the generated
nodes.                                                                     term clusters, we perform classification to identify similar
                                                                           causes and effects across various diseases.
2.2 Graph Clustering
   Clustering deals with identifying a pattern/structure in                2.3    Proposed Approach
the bunch of unlabeled data. In general, clustering organizes                The proposed approach works as follows: From the posts
data into groups whose members are related in some way                     of each topic, textual descriptions are extracted. Unique
and two or more data can be grouped into the same cluster                  terms (after removing stop words) are considered as nodes
if they are, in some way, falling close to each others’ context.           in the graph and the number of times a pair of terms co-
Clustering has many useful applications like finding a group               occurs in the entire corpus is considered as the weight of the
of people with similar behavior, processing orders, grouping               edge connecting the pair of terms. At first, we build word
plants and animals, grouping web blog data to access similar               cluster vectors using graph clustering algorithm.
patterns.
   While exploring a variety of possibilities to identify the
context of causes and effects of diseases from text fragments,             Algorithm 1 Building Word Clusters
feature space grows and it is hardly possible to limit the                 Input: A set of n textual descriptions (posts)
expansion of the new feature space containing the local                                   P = {p1 , p2 , · · · , pn }
contexts extracted from the informal writing of medical                    A set of predefined category labels C = {c1 , c2 , · · · , cl }
data. This could possibly be solved by using dimensionality
reduction techniques to limit the size of the document                     Build Word Cluster Vectors:
to be classified. This attempt first makes word cluster                     1: Extract text from posts and build the unique word list
vectors using unsupervised feature generation by identifying                2: for each unique term ti in the word list do
the related contexts. Then using the identified contexts,                   3:    Identify the existence of edges from ti to all other
supervised learning is performed for categorizing the given                       terms with nonzero positive weight.
text collection consisting of user posts.                                   4:    Store the co-occurring term with its corresponding
   First, we use the entire collection of medical diagnosis                       edge weight in the adjacency list
related posts and filter out the list of the distinguishable                5: end for
unique terms. Using these terms, we first build the weighted                6: Use kernel-based multilevel graph clustering algorithm
graph in which nodes represent terms and edges represent                       on the adjacency list and perform clustering to generate
the weight - the number of documents in which the given                        cluster IDs
pair of terms co-occurs across the collection of posts. For                 7: For every cluster ID, construct word clusters
2
    word and term are used interchangibly                                   8: Store these word cluster vectors


                                                                      29
  Secondly, we use these word cluster vectors to re-represent              S.No     Class                #Diseases (Posts)
text documents that discuss the causes and effects of a                      1      Asthma                          35 (43)
specific disease. Then we perform classification of the                      2      Breast-Cancer                   35 (37)
re-represented text documents. The proposed algorithm                        3      COPD                            41 (47)
inherently applies clustering semantically related terms and                 4      Cosmetic                        57 (64)
performs classification of them. Based on the word clusters                  5      Dental                         97 (100)
found in the first step, we perform the classification on                    6      Embarazo                        41 (52)
the clustered space of label pertaining to the terms in the                  7      Genetic-Disorder                59 (68)
original documents.
                                                                             8      Hepatitis                     110 (147)
                                                                             9      Kidney                          69 (89)
Algorithm 2 Classification using Word Clusters
                                                                            10      Liver-Transplant                64 (62)
 1: Preprocess the text documents by removing numbers,                      11      Oral                            35 (47)
    punctuations and stop-words (using SMART3 word list)                    12      Pathology                       31 (36)
                                                                            13      Respiratory                     48 (54)
 2: for each processed text (post) data pi in P do                          14      Thyroid-Cancer                  58 (63)
 3:    for each unique term in pi do                                        15      Varicose-Veins                  36 (52)
 4:      Identify its cluster id
 5:      Map the given feature in terms of its cluster ID
 6:      Augment text fragments with cluster ID mappings                             Table 1: Corpus Statistics
 7:    end for
 8: end for                                                                           Correctly Classified     Wrongly Classified
 9: Build classifier on these mapped/expanded text data                Actual=Yes     True Positive (TP)      False Negative (FN)
    (containing only cluster IDs) and use it to predict the            Actual=No      False Positive (FP)     True Negative (TN)
    class of the text having similar causes and effects
                                                                                  Table 2: 2-way confusion matrix
10: Compute the classification accuracy
Output: The category label(s) for texts (post) that have
similar causes and effects across diseases.                             Recall is measured by the following equation:
                                                                                                         TP
                                                                                        Recall(R) =
                                                                                                       TP + FN
3.   EXPERIMENTAL RESULTS                                               F-Measure is computed by considering the ratio between
                                                                      Precision and Recall and hence calculated as follows:
3.1 Corpus
   We crawled a subset of MedHelp4 forum data from the                                            (β 2 + 1)P R
                                                                                           Fβ =
world wide web. The MedHelp forum is organized as a set                                            β2P + R
of topics, each representing a specific disease and under each          The balanced F1 -measure with β = 1 (when P and R are
topic, there are several subtopics. Each subtopic consists            weighted equally), is given by
of several posts from both users (may be patients or their
dependents) and doctors as well.                                                                      2pr
                                                                                              F1 =
   For our experiments, we have selected 15 categories                                               P +R
covering the most widely discussed topics in the MedHelp                Classification accuracy is measured from confusion table
forum. The details of this experimental corpus is given in            as follows:
Table. 1
   We have used 3 different types of classifiers, namely                                            TP + TN
Naive Bayes, k−Nearest Neighbours (k−NN), and Support                             Accuracy =
                                                                                               TP + FP + FN + TN
Vector Machines (SVM), to test the effect of the proposed
approach. We have used rainbow5 to build the classification           3.1.2 Discussion
models and during classification, we have used 60% of the                We have observed that the proposed approach performs
data for training and 40% for testing.                                well in classifying the medical text having causes and
3.1.1    Evaluation Methodology                                       issues expressed by either the patients or their friends or
                                                                      relatives. Figure. 1 shows the classification accuracy of
   We have used Precision, Recall, F-Measure and the                  proposed approach with top 15 classes using three different
classification Accuracy to evaluate the quality of the                types of classifiers. While using the Naive Bayes method,
identified diseases having similar causes and effects. We use         the accuracy for the class “Asthma” goes down due to
the two-way confusion matrix given in Table. 2 to derive the          the fact that certain specific causes are pertaining to
evaluation measures:                                                  the “Respiratory” related disease. Similar misclassification
   Precision is defined as follows:                                   takes place across “Dental” and “Oral” classes and these
                                      TP                              misclassified instances share common causes. Even though,
                  P recision(P ) =
                                    TP + FP                           the diseases like “Breast-cancer” and “Thyroid-cancer” share
4
  http://www.medhelp.org/forums/list/                                 a common cause for cancer, misclassification is significantly
5                                                                     less. Additionally the misclassification takes place across the
  The ‘Bow’ Toolkit - http://www.cs.cmu.edu/~mccallum/
bow/                                                                  diseases: “Hepatitis” and “Liver-Transplant”.


                                                                 30
                     Classification of Top 15 Categories: A Comparison                                       Baseline vs Proposed Methods r A Comparison
            120.00                                                                                 90.00
                                                  NaiveBayes   krNN (k=30)   SVM                                                               82.54       Baseline     Proposed
                                                                                                                        78.86
                                                                                                   80.00
                                                                                                                                                                      74.13
            100.00                                                                                         70.86
                                                                                                   70.00
                                                                                                                                 62.54
             80.00                                                                                 60.00
                                                                                                                                                       54.13
 Accuracy


                                                                                        Accuracy
                                                                                                   50.00
             60.00
                                                                                                   40.00
             40.00
                                                                                                   30.00


             20.00                                                                                 20.00

                                                                                                   10.00
              0.00
                                                                                                    0.00
                                                                                                               NaiveBayes            krNN (k=30)               SVM


Figure 1: Classification of top 15 classes: A                                           Figure 2: Comparison of the overall classification
comparison of Naive Bayes, k−NN and SVM                                                 accuracy: Baseline vs Proposed approaches
methods


   We have observed the classification accuracy of k−NN                                 Acknowledgments:        This research is co-funded by the
classifier for various values of k (=10, 20, 30, 50) and found                          Irish Government (Enterprise Ireland) and European Union
that for k =30, the system performs very well. In this                                  (European Regional Development Fund).         Dr.    Philip
classification task, we have observed that the maximum                                  O’Reilly is the Principal Investigator responsible for this
number of instances from the class “Liver-Transplant” is                                research and can be contacted at Philip.Oreilly@ucc.ie.
misclassified under “Hetatitis” class as the causes of these
common diseases coincide by and large. At the same time,                                5. REFERENCES
none of the instances is classified correctly for the class                             [1] Dhillon, I. S., Guan, Y., and Kulis, B. Weighted
“Pathology” as these causes and effects are very similarly                                  graph cuts without eigenvectors a multilevel approach.
described as that of “Hepatitis” and “Thyroid-cancer”. While                                IEEE Trans. Pattern Anal. Mach. Intell. 29, 11 (2007),
using the SVM classifier, we noticed that some instances                                    1944–1957.
are misclassified across the classes: “Breast-cancer” and                               [2] Gabrilovich, E. Feature Generation for Textual
“Cosmetic”.      In this case, user raised cross reference                                  Information Retrieval Using World Knowledge. PhD
related queries with post surgical treatment of the Breast-                                 thesis, Technion - Israel Institute of Technology, Haifa,
cancer disease. Similar misclassification is found across the                               Israel, 2006.
classes: “Chronic Obstructive Pulmonary Disease” (COPD)                                 [3] Gabrilovich, E., and Markovitch, S. Feature
and “Resporatory”. Subsequently, we would like to apply                                     generation for text categorization using world
this approach for effective retrieval of causes and effects                                 knowledge. In Proceedings of the 19th International
pertaining to a specific disease. Also we will draw the                                     Joint Conference on Artificial Intelligence (San
information geometry of prominent diseases that share the                                   Francisco, CA, USA, 2005), IJCAI’05, Morgan
common causes/effects in our subsequent experiments.                                        Kaufmann Publishers Inc., pp. 1048–1053.
                                                                                        [4] Giles, J. Internet encyclopaedias go head to head.
4. CONCLUSION                                                                               Nature 438, 1 (2005), 900–901.
                                                                                        [5] Karypis, G., and Kumar, V. A fast and high quality
  We proposed a method to enable greater understanding
                                                                                            multilevel scheme for partitioning irregular graphs.
of various conditions, their symptoms, treatment and
                                                                                            SIAM J. Sci. Comput. 20, 1 (1998), 359–392.
management by identifying similar scenarios, reasoning
the effects to explore information geometry and learning                                [6] Mitra, P., Murthy, C. A., and Pal, S. K.
the associated contextual factors, from medical forum                                       Unsupervised feature selection using feature similarity.
information extracted from health services data. This                                       IEEE Trans. Pattern Anal. Mach. Intell. 24, 3 (2002),
approach identifies semantically similar causes and effects                                 301–312.
for any specific disease/condition, using implicit semantic                             [7] Pedersen, T., and Kulkarni, A. Identifying similar
interconnections among the medical terms.           We use                                  words and contexts in natural language with
graph based exploration on the terms and their relations                                    senseclusters. In Proc. of the 20th national conf. on
(causes/effects) across the collection of posts and explore                                 Artificial intelligence (2005), AAAI’05, AAAI Press,
the information geometry pertaining to the similar diseases.                                pp. 1694–1695.
We evaluated the relevance of the contextual information                                [8] Prasath, R., and Sarkar, S. Unsupervised feature
retrieved for a specific disease and/or similar factors                                     generation using knowledge repositories for effective
across different diseases. The proposed approach looks                                      text categorization. In Proceedings of the 2010
promising in capturing similar scenarios pertaining to                                      Conference on ECAI 2010: 19th European Conference
multiple diseases. This would enable medical practitioners                                  on Artificial Intelligence (Amsterdam, The
to have a multi-faceted view about any specific disease,                                    Netherlands, The Netherlands, 2010), IOS Press,
towards better decision making.                                                             pp. 1101–1102.


                                                                                   31

</pre>