<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLP CEN AMRITA @ SMM4H: Health Care Text Classification through Class Embeddings</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Barathi Ganesh Hullathy Balakrishnan, Vinayakumar, Anand Kumar Madasamy, Soman Kotti Padannayil Center for Computational Engineering and Networking (CEN), Amrita School of Engineering Coimbatore, Amrita Vishwa Vidyapeetham, Amrita University</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial Intelligence has been a major breakthrough in many domains. Now, it has started automating health care domain through Natural Language Processing and Computer Vision applications. As a part of it, researchers are now focusing more on mining health related information from the text shared through social media and clinical trials. This paper explains about our system for health care text classification tasks conducted by Health Language Processing (HLP) Lab. We experimented with representing the target classes available in task 1 and task 2 as vectors. The classification has been performed using Support Vector Machine. To compute the representation for target classes, we used traditional methods available in Vector Space Models and Vector Space Models of Semantics. In this shared task, the task 1 is about distinguishing the tweets mentioning ”adverse drug reaction” from the ones which do not. The task 2 is about distinguishing the tweets that includes personal medication intake, possible medication intake and non-intake. The preliminary results are satisfying in-order to continue the research in developing a representation method for target classes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Representation</title>
      <p>The objective here is to represent the given tweets into its equivalent numerical representation in-order to carry out
classification.</p>
    </sec>
    <sec id="sec-3">
      <title>Representation : Vector Space Models</title>
      <p>Document - Term Matrix (DTM) and Term Frequency - Inverse Document Frequency (TF-IDF) representation
methods are used in which the given tweets T = t1; t2; t3; :::; tn are presented as a matrix D with the dimension m n.
Here m represents the number of tweets and n represents the number of unique words present in the tweet collection
T .</p>
      <p>D = dtm(T )</p>
      <p>D = tf idf (T )</p>
      <p>U V T = svd(D)
In the above equation, U represents the distributional representation of tweets with the dimension of m m, V T
represents the distributional representation of the words with the dimension of n n and represents the significance
of the basis vectors present in U and V T . In detail, column vectors in U are the Eigen vector of DDT which represents
the column space, column vector in V T are the Eigen vector of DT D which in turns represents the row space and the
diagonal element of are the squared Eigen values of DDT and DT D. The computation of DDT finds the cross
co-occurrence of the words in the Matrix D. Finally, the resultant column vector in U is taken as D to for further
steps.</p>
    </sec>
    <sec id="sec-4">
      <title>Representation : Class Embedding</title>
      <p>We have experimented to represent the target classes as an entropy vector by summing up the tweets vectors available
in the matrix D with respect to the target class. This can be mathematically represented as,
In DTM the frequency count of the words alone are considered to form the representation for tweets3. In TF-IDF, along
with the frequency count of the words, frequency count of the words appearing across the tweets (inverse document
frequency) are also taken into the consideration4. This re-weighting scheme in TF-IDF gives higher weights to the
rarely occurring word and lower weights to the frequently occurring word.</p>
    </sec>
    <sec id="sec-5">
      <title>Representation : Vector Space Models of Semantics</title>
      <p>The matrix computed from the previous section undergoes matrix factorization to get the distributional representation
of tweets. These vectors can be seen as a semantic representation of tweets, as the vector produced out of matrix
factorization becomes the basis vector representation of matrix D. Here Singular Value Decomposition (SVD) is used
to perform the matrix factorization5,6.</p>
      <p>
        In above equation, Ce represents the class embedding (entropy vector of the class) , tc represents the target classes per
tweet and C represents the available unique target classes. The dimensions of the class embedding in Vector Space
Model representation is 1 n and 1 m in Vector Space Models of Semantics representation.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
m
Ce = X D[i; :] if tc = C
      </p>
      <p>i=1</p>
      <p>F = f eatures(D; Ce)</p>
    </sec>
    <sec id="sec-6">
      <title>Representation : Feature Learning</title>
      <p>The distance, similarity and correlation between the class embedding and tweet vectors are measured to get the feature
matrix in-order to perform the final prediction.</p>
      <p>Here F is the feature matrix with the dimension m (5 numberof uniquetargetclasses). The measured features
are Dot Product, Euclidean Distance, Chebyshve Distance, Bray Curtis Dissimilarity and Correlation7.</p>
    </sec>
    <sec id="sec-7">
      <title>Experiments</title>
      <p>This section details about how the proposed approach is applied on Task 1 and Task 2 data sets. Task 1 is a binary
classification problem8 and task 2 is a multi-class classification problem9. The dataset for both the tasks are provided
by shared task organizers and its statistics are given in Table 1 and Table 2. Each task’s data set includes training data,
development data and test data.</p>
    </sec>
    <sec id="sec-8">
      <title>Data</title>
      <p>Train
Dev
Test
The tweets in the given datasets are represented as a matrix using methods described in VSM and VSMs sections.
The available target classes per class is mentioned in Table 1 and Table 2. The submitted runs varies only with
representation but further classification remains same for all the runs. In task 1, the given data is represented as
DTM in run1, TF-IDF in run2, DTM followed by a SVD in run3 and TF-IDF followed by a SVD in run4. While
performing SVD we have taken column vectors from the U as a basis vector representation for tweets. The dimension
of the vectors is equal to the number of instances. Similar to task 1, task 2 is also computed with the four types of
representations.</p>
      <p>The class embedding for target classes are computed by summing up the tweet vectors that belonged to the respective
classes. In this way, for task 1 we have computed two class embeddings (ADR mentioned and ADR not mentioned).
For task 2, we have computed three class embeddings (personal medication intake, possible medication intake and
non-intake).</p>
      <p>On successive computation of class embeddings, the features are computed between the tweet vectors and class
embeddings as mentioned in Feature Learning Section. These measures are taken as the attributes and given to the classifier
to make the final prediction. In task 1, one class SVM is used to handle the label biasing problem. In task 2, SVM
with RBF kernel is used to make the final prediction.</p>
      <p>In task 1, it has been observed that except TF-IDF, the other representation methods shows higher error in training the
one class SVM with the ADR mention. Based on this, in submitted runs the training model is based on the tweets in
which the ADR is not mentioned. The observed training error rate for task 1 is given in Table 3.
In task 2, applying SVD tends to appear as the over fitted model by giving constant accuracy for 10 - cross 10 - fold
validation. Hence, we avoided to submit the multiple runs for task 2. We have submitted the model based on DTM
and class embedding. The final submitted runs were evaluated by the shared task organizers and the obtained results
are given in Table 4 and Table 5.</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusion</title>
      <p>The preliminary approach to class representation method attains considerable accuracy in both the tasks. It has been
observed that the imbalance in the target classes is the core reason for low score. Especially in the proposed class
1
2
3
4</p>
    </sec>
    <sec id="sec-10">
      <title>Training Error Against</title>
    </sec>
    <sec id="sec-11">
      <title>Module Same Class</title>
      <p>ADR not mentioned 930
ADR mentioned 930
ADR not mentioned 195
ADR not mentioned 856
Table 4: Task 1 Results</p>
    </sec>
    <sec id="sec-12">
      <title>ADR Precision</title>
      <p>0.057
0.056
0.087
0.186</p>
    </sec>
    <sec id="sec-13">
      <title>ADR Recall</title>
      <p>0.093
0.109
0.204
0.481
Table 5: Task 2 Results</p>
    </sec>
    <sec id="sec-14">
      <title>ADR F-score</title>
      <p>0.071
0.074
0.121
0.268</p>
    </sec>
    <sec id="sec-15">
      <title>Micro-averaged precision</title>
      <p>for classes 1 and 2
0.569</p>
    </sec>
    <sec id="sec-16">
      <title>Micro-averaged F-score</title>
      <p>for classes 1 and 2
0.462
representation the entropy of the target class vector is directly dependent on the number of instances that belonged to
the respective class. Hence the future work will be to focus on handling the label biasing problem, which is a common
scenario with many practical applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          <string-name>
            <given-names>HB</given-names>
            ,
            <surname>Anand Kumar</surname>
          </string-name>
          <string-name>
            <given-names>M</given-names>
            , and
            <surname>Soman</surname>
          </string-name>
          <string-name>
            <surname>KP</surname>
          </string-name>
          ,
          <source>Distributional Semantic Representation in Health Care Text Classification, Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <fpage>201</fpage>
          -
          <lpage>204</lpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1737</volume>
          /
          <fpage>T5</fpage>
          -3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          <string-name>
            <given-names>HB</given-names>
            ,
            <surname>Anand Kumar</surname>
          </string-name>
          <string-name>
            <given-names>M</given-names>
            , and
            <surname>Soman</surname>
          </string-name>
          <string-name>
            <surname>KP</surname>
          </string-name>
          ,
          <article-title>Vector Space Model as Cognitive Space for Text Classification</article-title>
          , arXiv,
          <year>2017</year>
          , http://arxiv.org/abs/1708.06068.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Antonellis</given-names>
            <surname>Ioannis</surname>
          </string-name>
          , and
          <article-title>Efstratios Gallopoulos, Exploring term-document matrices from matrix models in text mining, arXiv preprint (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ramos</given-names>
            <surname>Juan</surname>
          </string-name>
          ,
          <article-title>Using tf-idf to determine word relevance in document queries</article-title>
          ,
          <source>Proceedings of the first instructional conference on machine learning</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Thomas</surname>
            <given-names>K Landauer</given-names>
          </string-name>
          ,
          <article-title>Latent Semantic Analysis</article-title>
          ,
          <source>Encyclopedia of Cognitive Science</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          <string-name>
            <given-names>HB</given-names>
            ,
            <surname>Anand Kumar</surname>
          </string-name>
          <string-name>
            <given-names>M</given-names>
            , and
            <surname>Soman</surname>
          </string-name>
          <string-name>
            <surname>KP</surname>
          </string-name>
          , Statistical Semantics in Context Space, Working Notes of CLEF 2016 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          , E´ vora, Portugal,
          <fpage>5</fpage>
          -
          <lpage>8</lpage>
          September,
          <year>2016</year>
          ,
          <fpage>881</fpage>
          -
          <lpage>889</lpage>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1609</volume>
          /16090881.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cha</surname>
          </string-name>
          , Sung-Hyuk,
          <article-title>Comprehensive survey on distance/similarity measures between probability density functions</article-title>
          ,
          <source>City</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sarker</surname>
          </string-name>
          , Abeed, and Graciela Gonzalez,
          <article-title>Portable automatic text classification for adverse drug reaction detection via multi-corpus training</article-title>
          ,
          <source>Journal of biomedical informatics 53</source>
          (
          <year>2015</year>
          ):
          <fpage>196</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , Ari, Abeed Sarker, Masoud Rouhizadeh,
          <string-name>
            <surname>Karen O'Connor</surname>
          </string-name>
          , and Graciela Gonzalez,
          <article-title>Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System</article-title>
          ,
          <string-name>
            <surname>BioNLP</surname>
          </string-name>
          <year>2017</year>
          (
          <year>2017</year>
          ):
          <fpage>136</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>