=Paper= {{Paper |id=Vol-1737/T5-3 |storemode=property |title=Distributional Semantic Representation in Health Care Text Classification |pdfUrl=https://ceur-ws.org/Vol-1737/T5-3.pdf |volume=Vol-1737 |authors=Barathi Ganesh HB,Anand Kumar M,Soman K P |dblpUrl=https://dblp.org/rec/conf/fire/HBMP16a }} ==Distributional Semantic Representation in Health Care Text Classification== https://ceur-ws.org/Vol-1737/T5-3.pdf
Distributional Semantic Representation in Health Care Text
                      Classification
                                  NLP_CEN_AMRITA@CHIS-FIRE-2016
                       Barathi Ganesh HB                               Anand Kumar M and Soman KP
                    Artificial Intelligence Practice                 Center for Computational Engineering and
                     Tata Consultancy Services                                   Networking (CEN)
                            Kochi - 682 042                          Amrita School of Engineering, Coimbatore
                                   India                                   Amrita Vishwa Vidyapeetham
                 barathiganesh.hb@tcs.com                                      Amrita University, India
                                                                     m anandkumar@cb.amrita.edu, kp
                                                                           soman@.amrita.edu

ABSTRACT                                                           application. The application may be a sequential modeling
This paper describes about the our proposed system in the          tasks (Information Extraction) or text classification tasks
Consumer Health Information Search (CHIS) task. The ob-            (Document Retrieval, sentiment analysis on retrieved docu-
jective of the task 1 is to classify the sentences in the doc-     ments and Validation of retrieved documents).
ument into relevant or irrelevant with respect to the query           Document retrieval is primary task in text analytics ap-
and task 2 is analysing the sentiment of the sentences in the      plication in which the Consumer Health Information Search
documents with respect to the given query. In this proposed        (CHIS) is focused on validating the retrieved results (Rel-
approach distributional representation of text along with its      evant or Irrelevant) and performing sentiment analysis on
statistical and distance measures are carried over to perform      retrieved results (Support, Oppose and Neutral). The given
the given tasks as a text classification problem. In our ex-       problem can be viewed as a text classification problem with
periment, Non - Negative Matrix Factorization utilized to          the target classes as mentioned in above two tasks.
get the distributed representation of the document as well            Text classification is a classic application in text analytics
as queries, distance and correlation measures taken as the         domain, that is utilized in the multiple domains and indus-
features and Random Forest Tree utilized to perform the            tries in various forms. Given a text content, the classifier
classification. The proposed approach yields 70.19% in task        must have the capability of classifying it into the prede-
1 and 34.64% in task 2 as an average accuracy.                     fined set of classes [1]. This task becomes more complex,
                                                                   when the text contents includes medical descriptions (Drug
                                                                   names, Measurements and Dosages). This introduces the
Keywords                                                           problem during the representation as well as while mining
Health Science; Distributional Semantics; Non-Negative Ma-         information out of it.
trix Factorization; Term - Document Matrix; Text Classifi-            The fundamental component in classification task is text
cation                                                             representation, which tries to represent the given text into its
                                                                   equivalent form of numerical components. Later, these nu-
                                                                   merical components are utilized directly for the classification
1.   INTRODUCTION                                                  or will be used to extract the features required to perform the
   Over the past few years, tremendous amount of invest-           classification task. This text representation methods evolved
ment and research carried on to enhance the predictive an-         over the time to improve the originality of representation,
alytics through text analytics in health care domain [11,          which paves way to move from the frequency based repre-
10]. Health care information are available as a text (Clin-        sentation methods to the semantic representation methods.
ical Trails) in the form of admission notes, literature, re-       Though other methods are also available, this paper focuses
ports and summaries 123 . Unlike traditional structure of          only on Vector Space Model (VSM) and Vector Space Model
text resources, the unstructured nature of clinical trial’s        of Semantics (VSMs) [13].
text sources are introduces more challenges while mining              In VSM, the text is represented as a vector, based on the
information out of it. These available challenges induces re-      occurrence of terms (binary matrix) or frequency of the oc-
searchers to carry out the text analytics research to enhance      currence of terms (Term - Document Matrix) present in the
the developed model and to create the new models.                  given text. The given text is represented as a vector, based
   The informations explicitly available in Electronics Health     on frequency of terms that occur within the text by having
Records (EHR) but implicitly available in clinical trails as a     vocabulary built across the entire corpus. Here, ’terms’ rep-
form of text. Now, our primary problem is becomes, repre-          resents the words or the phrases [8]. Considering only the
senting text that can be easily and effectively used for further   term frequency is not sufficient, since it ignores the syntactic
                                                                   and semantic information that lies within the text.
1
  https://medlineplus.gov/                                            The term documents matrix is inefficient due to the bias-
2
  https://clinicaltrials.gov/                                      ing problem (i.e. few terms gets higher weight because of un-
3
  https://clinicaltrials.gov/
balanced and uninformative data). To overcome this, Term
Frequency - Inverse Document Frequency (TF-IDF) repre-
sentation method is introduced, which re-weighs the term
frequency based upon its presence across the documents [5].
It has a tendency to give higher weights to the rarely oc-
curring words, wherein these words may be misspelled or
uninformative words with respect to the classification task
which is obvious with clinical trail texts.
   The Vector Space Model of Semantics (VSMs) overcomes
the above mentioned shortcomings by weighing terms based
on the context. This is achieved by applying TDM on ma-
trix factorization methods like Singular Value Decomposi-
tion (SVD) and Non - Negative Matrix Factorization (NMF)
[9, 15, 12]. This has the ability of weighing terms though
it is not present in a given query. This is because, matrix
factorization leads to represent the TDM matrix with its
basis vectors [3]. This representation does not include the
syntactic information which requires large data and is com-
putationally high because of its high dimension.
   Word Embeddings along with the structure of the sen-
tence are utilized to represent the short texts. This requires
very less data and the dimension of the vector can be con-
trolled. To develop the Word to Vector (Word2Vec) model
it requires a very large corpus [14][2]. Here we are not con-
sidering it since we do not have large size clinical trails text
data. Followed by the representation, similarity measures is         Figure 1: Model Diagram for Distributional Repre-
carried on between the query and text documents to achieve           sentation of Text
the objective. Here similarity measures are distance measure
(Cosine distance, Euclidean distance, Jaccard distance, etc.)
and correlation measure (Pearson correlation coefficient) [4].       across the classes and in order to avoid the sparsity of the
   Considering above said pros and cons, here the proposed           representation, terms with the document frequency of one
approach is experimented to observe the performance of dis-          are eliminated. Here TF-IDF representation not considered.
tributional semantic representation of text in the classifica-       Because, it has a tendency to provide weighs for the rare
tion task. The given query and documents are represented             words which is more common in clinical texts (Drug names,
as a TDM matrix after the necessary preprocessing steps and          Measurements and Dosage levels). Here, advantage of the
NMF is applied on it to get the distributional representation.       TF-IDF representation is indirectly obtained by handling
Thereafter, distance measure and correlation measures be-            document frequency of the terms.
tween query vector of each document and vector represen-             2.3   Vector Space Model : Term - Document
tation of the sentences in the documents are computed in
order to perform the classification task.
                                                                           Matrix
                                                                       In TDM, vocabulary has been computed by finding unique
                                                                     words present in the given corpus. Then the number of times
2.    DISTRIBUTIONAL REPRESENTATION                                  term presents (term frequency) in each question is computed
   This section describes about the distributional represen-         against the vocabulary formed. The terms present in this
tation of the text, which is used further for the classification     vocabulary acts as a first level features.
task. The distributional representation aims to compute the
basis vector from the term frequency vector by applying                                
NMF on the TDM. The systematic approach for the dis-                                   A i,j = T DM (Corpus)                  (1)
tributional representation is given in Figure 1.                                      
                                                                                      A i = termf requency(qi )               (2)
2.1    Problem Definition
   Let, dk = s1 , s2 , s3 , ..., sn are the sentences in the k th       Where, i represents the ith sentence and j represents the
document in the document set D = d1 , d2 , d3 , ...dn , qi rep-      j th term in the vocabulary. In-order to improve the repre-
resents the ith query and C = c1 , c2 , ..., cn are the classes in   sentation, along with the unigram words, the bi-gram and
which s falls under with respect to the q and n is the size of       tri-gram phrases also considered after following above men-
corpus. The objective of the experimentation is to classify          tioned preprocessing steps.
each sentence in the document into its respective predefined         2.4   Vector Space Model of Semantics : Distri-
classes.
                                                                           butional Representation
2.2    Preprocessing                                                   The above computed TDM is applied on NMF to get the
  Few of the terms that appears across multiple classes will         distributional representation of the given corpus.
shows conflict towards the classification, where the terms                                             
                                                                                         W i,r = nmf ( A i,j )          (3)
generally gets low weighs in TF-IDF representation. Hence
these terms are eliminated if it occurs more than 3/4 times            In general matrix factorization is done to get the product
                                                                    tee [7]. The detailed statistics about the training and the
                                                                    testing set are given in Table 1.
                                                                       Task 1 : This task is becoming necessary unit in-order to
                                                                    filter the retrieved results from Information Retrieval (IR)
                                                                    application. This ensures the recall of the Search Engine
                                                                    which is mandatory in health care domain text analytics
                                                                    applications. With this information the remaining part of
                                                                    the section describes about the proposed approach in text
                                                                    classification in task 1.
                                                                       Let, dk = s1 , s2 , s3 , ..., sn are the sentences in the k th
                                                                    document in the document set as mentioned in the Table 1
                                                                    (D = skincare, M M r, HRT, Ecig, V itc), qi represents the
                                                                    ith query and C = Relevant, Irrelevant are the classes
                                                                    which the s falls under with respect to the q. n is size of
                                                                    corpus and this is also mentioned in Table 1.
                                                                       The objective of task is to classify the given question into
                                                                    its corresponding classes (Relevant, Irrelevant). The distri-
                                                                    butional representation of the given training and testing cor-
                                                                    pus are computed as described in the previous section. The
                                                                    systematic diagram for the remaining approach is given in
                                                                    Figure 2. After the representation, the similarity measures
                                                                    between query vector qi and sentence vectors in D are com-
 Figure 2: Model Diagram of Proposed Approach                       puted. The computed similarity measures are given in table
                                                                    3. These similarity measures that is computed are taken
                                                                    as the attributes for the supervised classification algorithm
of matrices, subject to their reconstruction that the error
                                                                    which is Random Forest Tree (RFT).
needs to be low. The product components from the factor-
                                                                       By having typical f C√f number of trees, output labels
ization gives the characteristics of the original matrix [9, 15].
                                                                    Y = y1, y2, y3, ..., yn (Relevant, Irrelevant) and feature set
Here NMF is incorporated along with the proposed model
                                                                    F = f 1, f 2, f 3, ..., f n the bagging repeatedly (B times -
to get the principal characteristic of the matrix, known as
                                                                    Number of trees) done by selecting random samples and at-
basis vector. Sentences may vary in its length but their rep-
                                                                    tributes from the training set and builds the decision tree
resentation needs to be of fixed size for its use in various
                                                                    for each set. Then the predictions for test set can be find
applications. TDM representation followed by the Non -
                                                                    by averaging the predictions from all the individual decision
Negative Matrix Factorization (NMF) will achieve this [16]
                                                                    trees built through the train set. It can be interpreted as
. Mathematically it can be represented as,
                                                                    following:

                          A ≈ W HT                           (4)
                                                                                          fb = f (Wb , Yb , Fb )                 (6)
  If A is m × n original TDM matrix, then W is i × r basis
matrix and H is j ×r mixture matrix. Linear combination of
basis vectors (column vectors) of W along with the weights
of H gives the approximated original matrix A. While fac-                                          B
                                                                                               1 X
torizing, initially random values are assigned to W and H                                Y =       fb (Ŵ F̂ )                   (7)
                                                                                               B
then the optimization function is applied on it to compute                                        b=1

appropriate W and H.                                                  In order to ensure the performance, 10-fold 10-cross val-
                                                                    idation performed during the training and this yields near
                                               2                    72% as a precision and it yields 68.12% against the test set.
               minfr (W, H) ≡ V − W H T                      (5)
                                               F
                                                                       Task 2 : This task is also necessary unit, in-order interpret
                        s.t. W, H ≥ 0
                                                                    further information from the retrieved results. This is task
   Here F is Forbenius norm and r is parameter for dimen-           is similar to the task 1 and carried on exactly similar to the
sion reduction, which is set to be 10 to have i × 10 fixed size     task 1 with target class labels as C = Oppose, Support, N eutral
vector for each question. Here NMF is used for finding out          . The classes in C are the final output label which the s falls
the basis vector for the following reasons: the non-negativity      under with respect to the q.
constraints makes interpretability straight forward than the           Here also 10-fold 10-cross validation performed during the
other factorization methods; selection of r is straight for-        training and this yields near 45% as a precision and it yields
ward; and the basis vector in semantic space is not con-            38.53% against the test set. The detailed description about
strained to be orthogonal, which is not affordable by finding       the results are given in Table 2.
singular vectors or eigen vectors [6].
                                                                    4.   CONCLUSION
3.   TEXT CLASSIFICATION                                              The objective of the tasks (Consumer Health Information
  For this experiment the data set has been provided by             Search) are performed as a text classification problem based
Consumer Health Information Search (CHIS) task commit-              on the distributional representation of the text by utilizing
                Document      # Training      # Task 1 Classes             # Task 2 Classes         # testing
                  Types       Sentences      Relevant  Irrelevant     Oppose   Support   Neutral    Sentences
                 skincare         65            34         31           34       16         15         90
                   MMr            70            49         21           34       33         3          60
                   HRT            60            45         15           41       15         4          74
                   Ecig           82            71         11           33       27         22         66
                   Vitc           64            38         26           32       21         11         74

                                              Table 1: Data-set Statistics
                                 Document     Task 1 Results in %        Task 2 Results in %
                                   Types      Max     Min    Ours        Max     Min    Ours
                                  skincare    79.55  48.86   48.86       73.8   23.86   23.86
                                    MMr       89.66  56.89   88.89       68.97  32.75   34.72
                                    HRT       93.06  38.89   75.86       54.16  22.2    43.10
                                    Ecig      76.56  46.88   76.56       67.19  29.69   39.06
                                    Vitc      78.38  55.41   60.81       50.00  31.08   32.43
                                  Average     78.10  54.84   70.19       55.43  33.64   34.64

                                               Table 2: Results Statistics

             Measured Feature Functions                              [6] D. D. Lee and H. S. Seung. Learning the parts of
               Similarity (Dot Product):                                 objects by non-negative matrix factorization. 1999.
                         PT ∗ Q                                      [7] S. Manjira, M. Sandya, and R. Shourya. Chis@fire:
                                                                         Overview of the chis track on consumer health
                   Euclidean  Distance:
                    qP                                                   information search. In Working notes of FIRE 2016 -
                        d             2
                        i=1 |Pi − Qi |                                   Forum for Information Retrieval Evaluation, Kolkata,
                                                                         India, December 7-10, 2016, CEUR Workshop
                Bray Curtis Dissimilarity:
                       Pd                                                Proceedings. CEUR-WS.org, 2016.
                        i=0 |Pi −Qi |
                       Pd
                        i=0 (Pi +Qi )
                                                                     [8] A. Manwar, H. Mahalle, K. Chinchkhede, and
                                                                         V. Chavan. A vector space model for information
                   Chebyshev Distance:                                   retrieval: A matlab approach. Indian Journal of
                      min |Pi − Qi |                                     Computer Science and Engineering, 3:222–229, 2012.
                        i

                       Correlation:                                  [9] R. Pat. An introduction to latent semantic analysis.
                      Pd (Pi −Qi )2                                      Indian Journal of Computer Science and Engineering.
                        i=1     Qi                                  [10] F. Popowich. Using text mining and natural language
                                                                         processing for health care claims processing. 2005.
       Table 3: Measured Similarity Features                        [11] W. Raghupathi and V. Raghupathi. Big data analytics
                                                                         in healthcare: promise and potential. volume 1, 2014.
                                                                    [12] U. Reshma, H. B. Barathi Ganesh, and
term - document matrix and non-negative matrix factoriza-                M. Anand Kumar. Author identification based on
tion. Even though the proposed approach not yields the                   word distribution in word space. 2015.
state of art performance in the tasks, the obtained results
                                                                    [13] G. Salton, W. Anita, and Y. Chung-Shu. A vector
are good enough to continue the research. These results are
                                                                         space model for automatic indexing. Communications
described in the Table 2. Distributional semantic represen-
                                                                         of the ACM, 18:613–620, 1975.
tation methods suffers from the well known problem ’Curse
                                                                    [14] R. Socher, E. Huang, J. Pennin, C. Manning, and
of Dimensionality’. Hence the future work will be focused
                                                                         A. Ng. Dynamic pooling and unfolding recursive
on reducing the dimensionality of the representation basis
                                                                         autoencoders for paraphrase detection. pages 801–809,
vectors and including the dedicated feature engineering for
                                                                         2011.
health care domain.
                                                                    [15] W. Xu, X. Liu, and Y. Gong. Xu w, liu x, gong y.
                                                                         document clustering based on non-negative matrix
5.   REFERENCES                                                          factorization. pages 267–273, 2003.
 [1] C. C. Aggarwal and C. Zhai. A survey of text
                                                                    [16] Y. Ye. Comparing matrix methods in text-based
     classification algorithms. InMining text data.
                                                                         information retrieval. 2000.
 [2] H. B. Barathi Ganesh, M. Anand Kumar, and K. P.
     Soman. Amrita cen at semeval-2016 task 1: Semantic
     relation from word embeddings in higher dimension.
 [3] W. Blacoe and M. Lapata. A comparison of
     vector-based representations for semantic composition.
 [4] S.-H. Cha. Comprehensive survey on
     distance/similarity measures between probability
     density functions. City, 1, 2007.
 [5] R. Juan. Using tf-idf to determine word relevance in
     document queries. 2003.