=Paper=
{{Paper
|id=Vol-1276/MedIR-SIGIR2014-07
|storemode=property
|title=Exploring Clustering Based Knowledge Discovery towards Improved Medical Diagnosis
|pdfUrl=https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-07.pdf
|volume=Vol-1276
|dblpUrl=https://dblp.org/rec/conf/sigir/PrasathO14
}}
==Exploring Clustering Based Knowledge Discovery towards Improved Medical Diagnosis==
Exploring Clustering Based Knowledge Discovery towards Improved Medical Diagnosis Rajendra Prasath Philip O’Reilly Dept. of Business Information Dept. of Business Information Systems, University College Cork Systems, University College Ireland Cork, Ireland R.Prasath@ucc.ie Philip.OReilly@ucc.ie ABSTRACT medical practitioners to make more informed decisions on We propose to develop a framework for an intelligent disease management. Illustrating the role of machine reasoner with capabilities that support complex decision learning as an enabler of this for more informed decision making processes in medical diagnosis. Identifying the support is a key aspect of this paper. In this paper, we causes, reasoning the effects to explore information geometry present an approach to identify causes and effects that and learning the associated factors, from medical forum exist across different diseases using a graph clustering based information extracted, are the core aspects of this work. As knowledge discovery approach. part of the proposed framework, we present an approach Understanding causation and correlation between that identifies semantically similar causes and effects for individual factors is key to decision making in multiple any specific disease from medical diagnosis literature using domains including medicine and business. Traditionally, implicit semantic interconnections among the medical terms. such association between elements has been identified First we crawled MedHelp1 forum data and considered two through human interpretation of text. However, this is types of information: forums data and posts data. Each a very time consuming, manual, labour intense process forum link points to a specific disease and consists of several and is limited by human capacity. The ability to mine topics pertaining to that disease. Each topic consists of textual content, identify association and the nature of multiple posts that carry either users’ queries/difficulties that association between elements using machine learning or doctor’s feedback pertaining to the issue(s) of the users. techniques provides significant opportunities. Specifically We use graph based exploration on the terms (diseases) and in the medical domain, where a significant amount of their relations (in terms of causes/effects) and explore the content is textual in nature (e.g. medical notes), having information geometry pertaining to similar diseases. We the ability to identify causation and correlation between performed a systematic evaluation to identify the relevance elements in large medical datasets provides significant of the contextual information retrieved for a specific disease opportunity for advancing medical research and enabling and similar factors across different diseases. The proposed better decision making pertaining to condition diagnosis and approach looks promising in capturing similar causes and/or patient management. effects that pertain to multiple diseases. This would enable In this paper, we attempt to identify and create medical practitioners to have a multi-faceted view of a term clusters using graph clustering approach and then specific disease/condition. perform topic classification. This approach improves the document classification task by putting the terms, that are semantically related, in the same cluster. Users could search for a specific disease and explore information pertaining to Keywords its causes and effects by means of semantically related texts. Causes and Effects, Medical Diagnosis, Semantically Similar diseases, Information Geometry, Graph Analysis 2. CLUSTERING BASED KNOWLEDGE 1. INTRODUCTION DISCOVERY Understanding the causes and effects pertaining to To incorporate natural language understanding, common- a specific disease is key to better prediction in sense and domain specific knowledge could be used to medical diagnosis and improved patient management. improve the text representation by including more generated Diseases/Conditions may have similar or semantically informative features to perform deep understanding of the related causes and effects. Furthermore, gaining insight document text than the mere Bag-of-Words approach [4, on how diseases are diagnosed and managed would enable 2, 8]. Mitra et al. [6] proposed an unsupervised feature selection algorithm suitable for data sets, based on 1 http://www.medhelp.org/forums/list measuring similarity between features whereby redundancy therein is removed. Pedersen and Kulkarni [7] presented a system called SenseClusters to cluster similar contexts in natural language text and assigns identifying labels Copyright is held by the author / owner(s) MedIR 2014, July 11, 2014, Gold Coast, Australia. to these clusters based on their content. In addition ACM SIGIR. to clustering similar contexts, it can be used to identify 28 synonyms and sets of related words2 . To incorporate each unique term, the list of documents in which it occurs is this kind of additional common-sense/domain knowledge, retrieved. Using this data, we build the weighted graph in Gabrilovich and Markovitch [3] used world knowledge which the edge between two terms would represent their from open source knowledge repository like Wikipedia to semantic association implicitly. This process is repeated generate additional features. Similar intuition is adopted for all features and a weighted graph for the overall data to form word clusters that generate features enriching the is constructed. Thus the problem is modeled into a graph document content in a better way. Then the documents clustering problem. This results in a graph G = (V, E, A) are represented in the knowledge-rich space of generated where |V | = n represents the number of unique terms; |E| features. This leads to better organization of semantically represents the number of edges and the adjacency matrix; related text representation. In this proposed scheme, given and A is |V | × |V | whose nonzero entries correspond to a knowledge repository, the text documents are examined the edge weight between a pair of terms (adjacency list is and their representation is enriched in a completely assumed in case of sparse matrix - in this case, number of mechanical way. Motivated by the above considerations, rows in the graph represents the total number of terms in our aim is to empower machine learning techniques for text the graph). representation with a substantially wider body of knowledge We use the kernel-based multilevel clustering algorithm like the one obtained from the superior inference capabilities proposed by Dhillon et al. [1] on the weighted input graph of humans. with the number of desired partitions. This algorithm uses three steps: coarsening, base-clustering, and refinement. In 2.1 Mathematical Formulation coarsening, the given graph is repeatedly transformed into In this section, we characterize the medical forum data smaller subgraphs. This process is repeated until a few in a formal way. Each post pertaining to a specific topic nodes remain in the graph. Then during base clustering, is informally written and we focus on terms and their co- regional growing approach [5] could be used with these few occurrences. We have n textual descriptions, viz-a-viz, nodes. The quality of the resulting clusters depends on the posts: P = {p1 , p2 , · · · , pn } and each post can be formally choice of the initial nodes. The refinement process is applied represented as a sequence of terms, illustrating the scenario as follows: If a node in Gi is in cluster c, then all nodes of the underlying disease, as follows: pi = {t1 , t2 , · · · , tm } in Gi−1 formed from that node are in cluster c. For more where 1 ≤ i ≤ m and m varies differently for different post. details, please refer to [1, 5]. We convert the entire text data of all posts into a graph, say In this work, we generate term clusters from extracted G = (V, E) where V represents the set of nodes (each term word graphs, using co-occurrence information of terms. The in P is considered as a node) and the co-occurrence of any task is to partition the graph into clusters so that terms pair of nodes across the posts is considered to be the edge could be grouped into a few subsets and dimensionality of representing the strength of association between the pair of the new term space is reduced. Now based on the generated nodes. term clusters, we perform classification to identify similar causes and effects across various diseases. 2.2 Graph Clustering Clustering deals with identifying a pattern/structure in 2.3 Proposed Approach the bunch of unlabeled data. In general, clustering organizes The proposed approach works as follows: From the posts data into groups whose members are related in some way of each topic, textual descriptions are extracted. Unique and two or more data can be grouped into the same cluster terms (after removing stop words) are considered as nodes if they are, in some way, falling close to each others’ context. in the graph and the number of times a pair of terms co- Clustering has many useful applications like finding a group occurs in the entire corpus is considered as the weight of the of people with similar behavior, processing orders, grouping edge connecting the pair of terms. At first, we build word plants and animals, grouping web blog data to access similar cluster vectors using graph clustering algorithm. patterns. While exploring a variety of possibilities to identify the context of causes and effects of diseases from text fragments, Algorithm 1 Building Word Clusters feature space grows and it is hardly possible to limit the Input: A set of n textual descriptions (posts) expansion of the new feature space containing the local P = {p1 , p2 , · · · , pn } contexts extracted from the informal writing of medical A set of predefined category labels C = {c1 , c2 , · · · , cl } data. This could possibly be solved by using dimensionality reduction techniques to limit the size of the document Build Word Cluster Vectors: to be classified. This attempt first makes word cluster 1: Extract text from posts and build the unique word list vectors using unsupervised feature generation by identifying 2: for each unique term ti in the word list do the related contexts. Then using the identified contexts, 3: Identify the existence of edges from ti to all other supervised learning is performed for categorizing the given terms with nonzero positive weight. text collection consisting of user posts. 4: Store the co-occurring term with its corresponding First, we use the entire collection of medical diagnosis edge weight in the adjacency list related posts and filter out the list of the distinguishable 5: end for unique terms. Using these terms, we first build the weighted 6: Use kernel-based multilevel graph clustering algorithm graph in which nodes represent terms and edges represent on the adjacency list and perform clustering to generate the weight - the number of documents in which the given cluster IDs pair of terms co-occurs across the collection of posts. For 7: For every cluster ID, construct word clusters 2 word and term are used interchangibly 8: Store these word cluster vectors 29 Secondly, we use these word cluster vectors to re-represent S.No Class #Diseases (Posts) text documents that discuss the causes and effects of a 1 Asthma 35 (43) specific disease. Then we perform classification of the 2 Breast-Cancer 35 (37) re-represented text documents. The proposed algorithm 3 COPD 41 (47) inherently applies clustering semantically related terms and 4 Cosmetic 57 (64) performs classification of them. Based on the word clusters 5 Dental 97 (100) found in the first step, we perform the classification on 6 Embarazo 41 (52) the clustered space of label pertaining to the terms in the 7 Genetic-Disorder 59 (68) original documents. 8 Hepatitis 110 (147) 9 Kidney 69 (89) Algorithm 2 Classification using Word Clusters 10 Liver-Transplant 64 (62) 1: Preprocess the text documents by removing numbers, 11 Oral 35 (47) punctuations and stop-words (using SMART3 word list) 12 Pathology 31 (36) 13 Respiratory 48 (54) 2: for each processed text (post) data pi in P do 14 Thyroid-Cancer 58 (63) 3: for each unique term in pi do 15 Varicose-Veins 36 (52) 4: Identify its cluster id 5: Map the given feature in terms of its cluster ID 6: Augment text fragments with cluster ID mappings Table 1: Corpus Statistics 7: end for 8: end for Correctly Classified Wrongly Classified 9: Build classifier on these mapped/expanded text data Actual=Yes True Positive (TP) False Negative (FN) (containing only cluster IDs) and use it to predict the Actual=No False Positive (FP) True Negative (TN) class of the text having similar causes and effects Table 2: 2-way confusion matrix 10: Compute the classification accuracy Output: The category label(s) for texts (post) that have similar causes and effects across diseases. Recall is measured by the following equation: TP Recall(R) = TP + FN 3. EXPERIMENTAL RESULTS F-Measure is computed by considering the ratio between Precision and Recall and hence calculated as follows: 3.1 Corpus We crawled a subset of MedHelp4 forum data from the (β 2 + 1)P R Fβ = world wide web. The MedHelp forum is organized as a set β2P + R of topics, each representing a specific disease and under each The balanced F1 -measure with β = 1 (when P and R are topic, there are several subtopics. Each subtopic consists weighted equally), is given by of several posts from both users (may be patients or their dependents) and doctors as well. 2pr F1 = For our experiments, we have selected 15 categories P +R covering the most widely discussed topics in the MedHelp Classification accuracy is measured from confusion table forum. The details of this experimental corpus is given in as follows: Table. 1 We have used 3 different types of classifiers, namely TP + TN Naive Bayes, k−Nearest Neighbours (k−NN), and Support Accuracy = TP + FP + FN + TN Vector Machines (SVM), to test the effect of the proposed approach. We have used rainbow5 to build the classification 3.1.2 Discussion models and during classification, we have used 60% of the We have observed that the proposed approach performs data for training and 40% for testing. well in classifying the medical text having causes and 3.1.1 Evaluation Methodology issues expressed by either the patients or their friends or relatives. Figure. 1 shows the classification accuracy of We have used Precision, Recall, F-Measure and the proposed approach with top 15 classes using three different classification Accuracy to evaluate the quality of the types of classifiers. While using the Naive Bayes method, identified diseases having similar causes and effects. We use the accuracy for the class “Asthma” goes down due to the two-way confusion matrix given in Table. 2 to derive the the fact that certain specific causes are pertaining to evaluation measures: the “Respiratory” related disease. Similar misclassification Precision is defined as follows: takes place across “Dental” and “Oral” classes and these TP misclassified instances share common causes. Even though, P recision(P ) = TP + FP the diseases like “Breast-cancer” and “Thyroid-cancer” share 4 http://www.medhelp.org/forums/list/ a common cause for cancer, misclassification is significantly 5 less. Additionally the misclassification takes place across the The ‘Bow’ Toolkit - http://www.cs.cmu.edu/~mccallum/ bow/ diseases: “Hepatitis” and “Liver-Transplant”. 30 Classification of Top 15 Categories: A Comparison Baseline vs Proposed Methods r A Comparison 120.00 90.00 NaiveBayes krNN (k=30) SVM 82.54 Baseline Proposed 78.86 80.00 74.13 100.00 70.86 70.00 62.54 80.00 60.00 54.13 Accuracy Accuracy 50.00 60.00 40.00 40.00 30.00 20.00 20.00 10.00 0.00 0.00 NaiveBayes krNN (k=30) SVM Figure 1: Classification of top 15 classes: A Figure 2: Comparison of the overall classification comparison of Naive Bayes, k−NN and SVM accuracy: Baseline vs Proposed approaches methods We have observed the classification accuracy of k−NN Acknowledgments: This research is co-funded by the classifier for various values of k (=10, 20, 30, 50) and found Irish Government (Enterprise Ireland) and European Union that for k =30, the system performs very well. In this (European Regional Development Fund). Dr. Philip classification task, we have observed that the maximum O’Reilly is the Principal Investigator responsible for this number of instances from the class “Liver-Transplant” is research and can be contacted at Philip.Oreilly@ucc.ie. misclassified under “Hetatitis” class as the causes of these common diseases coincide by and large. At the same time, 5. REFERENCES none of the instances is classified correctly for the class [1] Dhillon, I. S., Guan, Y., and Kulis, B. Weighted “Pathology” as these causes and effects are very similarly graph cuts without eigenvectors a multilevel approach. described as that of “Hepatitis” and “Thyroid-cancer”. While IEEE Trans. Pattern Anal. Mach. Intell. 29, 11 (2007), using the SVM classifier, we noticed that some instances 1944–1957. are misclassified across the classes: “Breast-cancer” and [2] Gabrilovich, E. Feature Generation for Textual “Cosmetic”. In this case, user raised cross reference Information Retrieval Using World Knowledge. PhD related queries with post surgical treatment of the Breast- thesis, Technion - Israel Institute of Technology, Haifa, cancer disease. Similar misclassification is found across the Israel, 2006. classes: “Chronic Obstructive Pulmonary Disease” (COPD) [3] Gabrilovich, E., and Markovitch, S. Feature and “Resporatory”. Subsequently, we would like to apply generation for text categorization using world this approach for effective retrieval of causes and effects knowledge. In Proceedings of the 19th International pertaining to a specific disease. Also we will draw the Joint Conference on Artificial Intelligence (San information geometry of prominent diseases that share the Francisco, CA, USA, 2005), IJCAI’05, Morgan common causes/effects in our subsequent experiments. Kaufmann Publishers Inc., pp. 1048–1053. [4] Giles, J. Internet encyclopaedias go head to head. 4. CONCLUSION Nature 438, 1 (2005), 900–901. [5] Karypis, G., and Kumar, V. A fast and high quality We proposed a method to enable greater understanding multilevel scheme for partitioning irregular graphs. of various conditions, their symptoms, treatment and SIAM J. Sci. Comput. 20, 1 (1998), 359–392. management by identifying similar scenarios, reasoning the effects to explore information geometry and learning [6] Mitra, P., Murthy, C. A., and Pal, S. K. the associated contextual factors, from medical forum Unsupervised feature selection using feature similarity. information extracted from health services data. This IEEE Trans. Pattern Anal. Mach. Intell. 24, 3 (2002), approach identifies semantically similar causes and effects 301–312. for any specific disease/condition, using implicit semantic [7] Pedersen, T., and Kulkarni, A. Identifying similar interconnections among the medical terms. We use words and contexts in natural language with graph based exploration on the terms and their relations senseclusters. In Proc. of the 20th national conf. on (causes/effects) across the collection of posts and explore Artificial intelligence (2005), AAAI’05, AAAI Press, the information geometry pertaining to the similar diseases. pp. 1694–1695. We evaluated the relevance of the contextual information [8] Prasath, R., and Sarkar, S. Unsupervised feature retrieved for a specific disease and/or similar factors generation using knowledge repositories for effective across different diseases. The proposed approach looks text categorization. In Proceedings of the 2010 promising in capturing similar scenarios pertaining to Conference on ECAI 2010: 19th European Conference multiple diseases. This would enable medical practitioners on Artificial Intelligence (Amsterdam, The to have a multi-faceted view about any specific disease, Netherlands, The Netherlands, 2010), IOS Press, towards better decision making. pp. 1101–1102. 31