=Paper=
{{Paper
|id=Vol-1737/T5-3
|storemode=property
|title=Distributional Semantic Representation in Health Care Text Classification
|pdfUrl=https://ceur-ws.org/Vol-1737/T5-3.pdf
|volume=Vol-1737
|authors=Barathi Ganesh HB,Anand Kumar M,Soman K P
|dblpUrl=https://dblp.org/rec/conf/fire/HBMP16a
}}
==Distributional Semantic Representation in Health Care Text Classification==
Distributional Semantic Representation in Health Care Text Classification NLP_CEN_AMRITA@CHIS-FIRE-2016 Barathi Ganesh HB Anand Kumar M and Soman KP Artificial Intelligence Practice Center for Computational Engineering and Tata Consultancy Services Networking (CEN) Kochi - 682 042 Amrita School of Engineering, Coimbatore India Amrita Vishwa Vidyapeetham barathiganesh.hb@tcs.com Amrita University, India m anandkumar@cb.amrita.edu, kp soman@.amrita.edu ABSTRACT application. The application may be a sequential modeling This paper describes about the our proposed system in the tasks (Information Extraction) or text classification tasks Consumer Health Information Search (CHIS) task. The ob- (Document Retrieval, sentiment analysis on retrieved docu- jective of the task 1 is to classify the sentences in the doc- ments and Validation of retrieved documents). ument into relevant or irrelevant with respect to the query Document retrieval is primary task in text analytics ap- and task 2 is analysing the sentiment of the sentences in the plication in which the Consumer Health Information Search documents with respect to the given query. In this proposed (CHIS) is focused on validating the retrieved results (Rel- approach distributional representation of text along with its evant or Irrelevant) and performing sentiment analysis on statistical and distance measures are carried over to perform retrieved results (Support, Oppose and Neutral). The given the given tasks as a text classification problem. In our ex- problem can be viewed as a text classification problem with periment, Non - Negative Matrix Factorization utilized to the target classes as mentioned in above two tasks. get the distributed representation of the document as well Text classification is a classic application in text analytics as queries, distance and correlation measures taken as the domain, that is utilized in the multiple domains and indus- features and Random Forest Tree utilized to perform the tries in various forms. Given a text content, the classifier classification. The proposed approach yields 70.19% in task must have the capability of classifying it into the prede- 1 and 34.64% in task 2 as an average accuracy. fined set of classes [1]. This task becomes more complex, when the text contents includes medical descriptions (Drug names, Measurements and Dosages). This introduces the Keywords problem during the representation as well as while mining Health Science; Distributional Semantics; Non-Negative Ma- information out of it. trix Factorization; Term - Document Matrix; Text Classifi- The fundamental component in classification task is text cation representation, which tries to represent the given text into its equivalent form of numerical components. Later, these nu- merical components are utilized directly for the classification 1. INTRODUCTION or will be used to extract the features required to perform the Over the past few years, tremendous amount of invest- classification task. This text representation methods evolved ment and research carried on to enhance the predictive an- over the time to improve the originality of representation, alytics through text analytics in health care domain [11, which paves way to move from the frequency based repre- 10]. Health care information are available as a text (Clin- sentation methods to the semantic representation methods. ical Trails) in the form of admission notes, literature, re- Though other methods are also available, this paper focuses ports and summaries 123 . Unlike traditional structure of only on Vector Space Model (VSM) and Vector Space Model text resources, the unstructured nature of clinical trial’s of Semantics (VSMs) [13]. text sources are introduces more challenges while mining In VSM, the text is represented as a vector, based on the information out of it. These available challenges induces re- occurrence of terms (binary matrix) or frequency of the oc- searchers to carry out the text analytics research to enhance currence of terms (Term - Document Matrix) present in the the developed model and to create the new models. given text. The given text is represented as a vector, based The informations explicitly available in Electronics Health on frequency of terms that occur within the text by having Records (EHR) but implicitly available in clinical trails as a vocabulary built across the entire corpus. Here, ’terms’ rep- form of text. Now, our primary problem is becomes, repre- resents the words or the phrases [8]. Considering only the senting text that can be easily and effectively used for further term frequency is not sufficient, since it ignores the syntactic and semantic information that lies within the text. 1 https://medlineplus.gov/ The term documents matrix is inefficient due to the bias- 2 https://clinicaltrials.gov/ ing problem (i.e. few terms gets higher weight because of un- 3 https://clinicaltrials.gov/ balanced and uninformative data). To overcome this, Term Frequency - Inverse Document Frequency (TF-IDF) repre- sentation method is introduced, which re-weighs the term frequency based upon its presence across the documents [5]. It has a tendency to give higher weights to the rarely oc- curring words, wherein these words may be misspelled or uninformative words with respect to the classification task which is obvious with clinical trail texts. The Vector Space Model of Semantics (VSMs) overcomes the above mentioned shortcomings by weighing terms based on the context. This is achieved by applying TDM on ma- trix factorization methods like Singular Value Decomposi- tion (SVD) and Non - Negative Matrix Factorization (NMF) [9, 15, 12]. This has the ability of weighing terms though it is not present in a given query. This is because, matrix factorization leads to represent the TDM matrix with its basis vectors [3]. This representation does not include the syntactic information which requires large data and is com- putationally high because of its high dimension. Word Embeddings along with the structure of the sen- tence are utilized to represent the short texts. This requires very less data and the dimension of the vector can be con- trolled. To develop the Word to Vector (Word2Vec) model it requires a very large corpus [14][2]. Here we are not con- sidering it since we do not have large size clinical trails text data. Followed by the representation, similarity measures is Figure 1: Model Diagram for Distributional Repre- carried on between the query and text documents to achieve sentation of Text the objective. Here similarity measures are distance measure (Cosine distance, Euclidean distance, Jaccard distance, etc.) and correlation measure (Pearson correlation coefficient) [4]. across the classes and in order to avoid the sparsity of the Considering above said pros and cons, here the proposed representation, terms with the document frequency of one approach is experimented to observe the performance of dis- are eliminated. Here TF-IDF representation not considered. tributional semantic representation of text in the classifica- Because, it has a tendency to provide weighs for the rare tion task. The given query and documents are represented words which is more common in clinical texts (Drug names, as a TDM matrix after the necessary preprocessing steps and Measurements and Dosage levels). Here, advantage of the NMF is applied on it to get the distributional representation. TF-IDF representation is indirectly obtained by handling Thereafter, distance measure and correlation measures be- document frequency of the terms. tween query vector of each document and vector represen- 2.3 Vector Space Model : Term - Document tation of the sentences in the documents are computed in order to perform the classification task. Matrix In TDM, vocabulary has been computed by finding unique words present in the given corpus. Then the number of times 2. DISTRIBUTIONAL REPRESENTATION term presents (term frequency) in each question is computed This section describes about the distributional represen- against the vocabulary formed. The terms present in this tation of the text, which is used further for the classification vocabulary acts as a first level features. task. The distributional representation aims to compute the basis vector from the term frequency vector by applying NMF on the TDM. The systematic approach for the dis- A i,j = T DM (Corpus) (1) tributional representation is given in Figure 1. A i = termf requency(qi ) (2) 2.1 Problem Definition Let, dk = s1 , s2 , s3 , ..., sn are the sentences in the k th Where, i represents the ith sentence and j represents the document in the document set D = d1 , d2 , d3 , ...dn , qi rep- j th term in the vocabulary. In-order to improve the repre- resents the ith query and C = c1 , c2 , ..., cn are the classes in sentation, along with the unigram words, the bi-gram and which s falls under with respect to the q and n is the size of tri-gram phrases also considered after following above men- corpus. The objective of the experimentation is to classify tioned preprocessing steps. each sentence in the document into its respective predefined 2.4 Vector Space Model of Semantics : Distri- classes. butional Representation 2.2 Preprocessing The above computed TDM is applied on NMF to get the Few of the terms that appears across multiple classes will distributional representation of the given corpus. shows conflict towards the classification, where the terms W i,r = nmf ( A i,j ) (3) generally gets low weighs in TF-IDF representation. Hence these terms are eliminated if it occurs more than 3/4 times In general matrix factorization is done to get the product tee [7]. The detailed statistics about the training and the testing set are given in Table 1. Task 1 : This task is becoming necessary unit in-order to filter the retrieved results from Information Retrieval (IR) application. This ensures the recall of the Search Engine which is mandatory in health care domain text analytics applications. With this information the remaining part of the section describes about the proposed approach in text classification in task 1. Let, dk = s1 , s2 , s3 , ..., sn are the sentences in the k th document in the document set as mentioned in the Table 1 (D = skincare, M M r, HRT, Ecig, V itc), qi represents the ith query and C = Relevant, Irrelevant are the classes which the s falls under with respect to the q. n is size of corpus and this is also mentioned in Table 1. The objective of task is to classify the given question into its corresponding classes (Relevant, Irrelevant). The distri- butional representation of the given training and testing cor- pus are computed as described in the previous section. The systematic diagram for the remaining approach is given in Figure 2. After the representation, the similarity measures between query vector qi and sentence vectors in D are com- Figure 2: Model Diagram of Proposed Approach puted. The computed similarity measures are given in table 3. These similarity measures that is computed are taken as the attributes for the supervised classification algorithm of matrices, subject to their reconstruction that the error which is Random Forest Tree (RFT). needs to be low. The product components from the factor- By having typical f C√f number of trees, output labels ization gives the characteristics of the original matrix [9, 15]. Y = y1, y2, y3, ..., yn (Relevant, Irrelevant) and feature set Here NMF is incorporated along with the proposed model F = f 1, f 2, f 3, ..., f n the bagging repeatedly (B times - to get the principal characteristic of the matrix, known as Number of trees) done by selecting random samples and at- basis vector. Sentences may vary in its length but their rep- tributes from the training set and builds the decision tree resentation needs to be of fixed size for its use in various for each set. Then the predictions for test set can be find applications. TDM representation followed by the Non - by averaging the predictions from all the individual decision Negative Matrix Factorization (NMF) will achieve this [16] trees built through the train set. It can be interpreted as . Mathematically it can be represented as, following: A ≈ W HT (4) fb = f (Wb , Yb , Fb ) (6) If A is m × n original TDM matrix, then W is i × r basis matrix and H is j ×r mixture matrix. Linear combination of basis vectors (column vectors) of W along with the weights of H gives the approximated original matrix A. While fac- B 1 X torizing, initially random values are assigned to W and H Y = fb (Ŵ F̂ ) (7) B then the optimization function is applied on it to compute b=1 appropriate W and H. In order to ensure the performance, 10-fold 10-cross val- idation performed during the training and this yields near 2 72% as a precision and it yields 68.12% against the test set. minfr (W, H) ≡ V − W H T (5) F Task 2 : This task is also necessary unit, in-order interpret s.t. W, H ≥ 0 further information from the retrieved results. This is task Here F is Forbenius norm and r is parameter for dimen- is similar to the task 1 and carried on exactly similar to the sion reduction, which is set to be 10 to have i × 10 fixed size task 1 with target class labels as C = Oppose, Support, N eutral vector for each question. Here NMF is used for finding out . The classes in C are the final output label which the s falls the basis vector for the following reasons: the non-negativity under with respect to the q. constraints makes interpretability straight forward than the Here also 10-fold 10-cross validation performed during the other factorization methods; selection of r is straight for- training and this yields near 45% as a precision and it yields ward; and the basis vector in semantic space is not con- 38.53% against the test set. The detailed description about strained to be orthogonal, which is not affordable by finding the results are given in Table 2. singular vectors or eigen vectors [6]. 4. CONCLUSION 3. TEXT CLASSIFICATION The objective of the tasks (Consumer Health Information For this experiment the data set has been provided by Search) are performed as a text classification problem based Consumer Health Information Search (CHIS) task commit- on the distributional representation of the text by utilizing Document # Training # Task 1 Classes # Task 2 Classes # testing Types Sentences Relevant Irrelevant Oppose Support Neutral Sentences skincare 65 34 31 34 16 15 90 MMr 70 49 21 34 33 3 60 HRT 60 45 15 41 15 4 74 Ecig 82 71 11 33 27 22 66 Vitc 64 38 26 32 21 11 74 Table 1: Data-set Statistics Document Task 1 Results in % Task 2 Results in % Types Max Min Ours Max Min Ours skincare 79.55 48.86 48.86 73.8 23.86 23.86 MMr 89.66 56.89 88.89 68.97 32.75 34.72 HRT 93.06 38.89 75.86 54.16 22.2 43.10 Ecig 76.56 46.88 76.56 67.19 29.69 39.06 Vitc 78.38 55.41 60.81 50.00 31.08 32.43 Average 78.10 54.84 70.19 55.43 33.64 34.64 Table 2: Results Statistics Measured Feature Functions [6] D. D. Lee and H. S. Seung. Learning the parts of Similarity (Dot Product): objects by non-negative matrix factorization. 1999. PT ∗ Q [7] S. Manjira, M. Sandya, and R. Shourya. Chis@fire: Overview of the chis track on consumer health Euclidean Distance: qP information search. In Working notes of FIRE 2016 - d 2 i=1 |Pi − Qi | Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Bray Curtis Dissimilarity: Pd Proceedings. CEUR-WS.org, 2016. i=0 |Pi −Qi | Pd i=0 (Pi +Qi ) [8] A. Manwar, H. Mahalle, K. Chinchkhede, and V. Chavan. A vector space model for information Chebyshev Distance: retrieval: A matlab approach. Indian Journal of min |Pi − Qi | Computer Science and Engineering, 3:222–229, 2012. i Correlation: [9] R. Pat. An introduction to latent semantic analysis. Pd (Pi −Qi )2 Indian Journal of Computer Science and Engineering. i=1 Qi [10] F. Popowich. Using text mining and natural language processing for health care claims processing. 2005. Table 3: Measured Similarity Features [11] W. Raghupathi and V. Raghupathi. Big data analytics in healthcare: promise and potential. volume 1, 2014. [12] U. Reshma, H. B. Barathi Ganesh, and term - document matrix and non-negative matrix factoriza- M. Anand Kumar. Author identification based on tion. Even though the proposed approach not yields the word distribution in word space. 2015. state of art performance in the tasks, the obtained results [13] G. Salton, W. Anita, and Y. Chung-Shu. A vector are good enough to continue the research. These results are space model for automatic indexing. Communications described in the Table 2. Distributional semantic represen- of the ACM, 18:613–620, 1975. tation methods suffers from the well known problem ’Curse [14] R. Socher, E. Huang, J. Pennin, C. Manning, and of Dimensionality’. Hence the future work will be focused A. Ng. Dynamic pooling and unfolding recursive on reducing the dimensionality of the representation basis autoencoders for paraphrase detection. pages 801–809, vectors and including the dedicated feature engineering for 2011. health care domain. [15] W. Xu, X. Liu, and Y. Gong. Xu w, liu x, gong y. document clustering based on non-negative matrix 5. REFERENCES factorization. pages 267–273, 2003. [1] C. C. Aggarwal and C. Zhai. A survey of text [16] Y. Ye. Comparing matrix methods in text-based classification algorithms. InMining text data. information retrieval. 2000. [2] H. B. Barathi Ganesh, M. Anand Kumar, and K. P. Soman. Amrita cen at semeval-2016 task 1: Semantic relation from word embeddings in higher dimension. [3] W. Blacoe and M. Lapata. A comparison of vector-based representations for semantic composition. [4] S.-H. Cha. Comprehensive survey on distance/similarity measures between probability density functions. City, 1, 2007. [5] R. Juan. Using tf-idf to determine word relevance in document queries. 2003.