=Paper= {{Paper |id=Vol-1737/T3-8 |storemode=property |title= Distributional Semantic Representation for Text Classification and Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-1737/T3-8.pdf |volume=Vol-1737 |authors=Barathi Ganesh HB,Anand Kumar M,Soman K P |dblpUrl=https://dblp.org/rec/conf/fire/HBMP16 }} == Distributional Semantic Representation for Text Classification and Information Retrieval== https://ceur-ws.org/Vol-1737/T3-8.pdf
            Distributional Semantic Representation for Text
                Classification and Information Retrieval
                                [NLP_CEN_AMRITA@MSIR-FIRE-2016]
                      Barathi Ganesh HB                               Anand Kumar M and Soman KP
                   Artificial Intelligence Practice                 Center for Computational Engineering and
                    Tata Consultancy Services                                   Networking (CEN)
                           Kochi - 682 042                          Amrita School of Engineering, Coimbatore
                                  India                                   Amrita Vishwa Vidyapeetham
                 barathiganesh.hb@tcs.com                                     Amrita University, India
                                                                    m anandkumar@cb.amrita.edu, kp
                                                                          soman@.amrita.edu

ABSTRACT                                                          the semantic representation methods. Though other meth-
The objective of this experiment is to validate the perfor-       ods like set-theoretic Boolean systems are also available, this
mance of the distributional semantic representation of text       paper focuses only on Vector Space Model (VSM) and Vec-
in the classification (Question Classification) task and the      tor Space Model of Semantics (VSMs) [13].
Information Retrieval task. Followed by the distributional           In VSM, the text is represented as a vector, based on
representation, first level classification of the questions is    the occurrence of terms (binary matrix) or frequency of the
performed and relevant tweets with respect to the given           occurrence of terms (Term - Document Matrix) present in
queries are retrieved. The distributional representation of       the given text. The given text is represented as a vector,
text is obtained by performing Non - Negative Matrix Fac-         based on frequency of terms that occur in the text. Here,
torization on top of the Document - Term Matrix in the            ’terms’ represents words or phrases [9]. Considering only the
training and test corpus. To improve the semantic repre-          term frequency is not sufficient, since it ignores syntactic and
sentation of the text, phrases are also considered along with     semantic information that lies within the text.
the words. This proposed approach achieved 80% as a F-1              The term documents matrix is inefficient due to the bi-
measure and 0.0377 as a mean average precision against the        asing problem (i.e. few terms gets higher weight because
its respective Mixed Script Information Retrieval task1 and       of unbalanced and uninformative data). To overcome this,
task 2 test sets.                                                 Term Frequency - Inverse Document Frequency (TF-IDF)
                                                                  representation method is introduced, which re-weighs the
                                                                  terms based upon its presence across the documents [7]. It
Keywords                                                          has a tendency to give higher weights to the rarely occur-
Distributional Semantics; Non-Negative Matrix Factoriza-          ring words, wherein these words may be misspelled which is
tion; Term - Document Matrix; Text Classification                 obvious with social media texts.
                                                                     The Vector Space Model of Semantics (VSMs) overcomes
                                                                  the above mentioned shortcomings by weighing terms based
1.   INTRODUCTION                                                 on the context. This is achieved by applying TDM on ma-
   Information Retrieval (IR) and Text classification are the     trix factorization methods like Singular Value Decomposi-
classic applications in text analytics domain, that is uti-       tion (SVD) and Non - Negative Matrix Factorization (NMF)
lized in the multiple domains and industries in various forms.    [10, 15, 11]. This has the ability of weighing terms though
Given a text content, the classifier must have the capability     it is not present in a given query. This is because, matrix
of classifying it into the predefined set of classes and given    factorization leads to represent the TDM matrix with its
a query, the search engine must have the capability of re-        basis vectors [5]. This representation does not include the
trieving relevant text content within the stored collection of    syntactic information which requires large data and is com-
text [1][12]. This task becomes more complex, when the text       putationally high because of its high dimension.
contents are represented in more than one language. This             Word Embeddings along with the structure of the sentence
introduces the problem during the representation as well as       are utilized to represent the short texts. This requires very
while mining information out of it.                               less data and the dimension of the vector can be controlled.
   The fundamental component in classification and retrieval      But to develop the Word to Vector (Word2Vec) model it re-
task is text representation, which tries to represent the given   quires a very large corpus [14][4]. Here we are not consider-
text into its equivalent form of numerical components. Later,     ing it since we do not have large size mixed script text data.
these numerical components are utilized directly to perform       Followed by representation, similarity measures is carried
the further actions or will be used to extract the features       on in-order to perform the question classification task. Here
required for performing further action.                           similarity measures are distance measure (Cosine distance,
   This text representation methods evolved over the time to      Euclidean distance, Jaccard distance, etc.) and correlation
improve the originality of representation, which paves way        measure (Pearson correlation coefficient) [6].
to move from the frequency based representation methods to
                                                                       across the classes and in order to avoid the sparsity of the
                                                                       representation, terms with the document frequency of one
                                                                       are eliminated. Here TF-IDF representation not consid-
                                                                       ered. Because, it has a tendency to provide weighs for the
                                                                       rare words which is more common in mixed script texts.
                                                                       Here, advantage of the TF-IDF representation is indirectly
                                                                       obtained by handling document frequency of the terms.

                                                                       2.3    Vector Space Model : Term - Document
                                                                              Matrix
                                                                         In TDM, vocabulary has been computed by finding unique
                                                                       words present in the given corpus. Then the number of times
                                                                       term presents (term frequency) in each question is computed
                                                                       against the vocabulary formed. The terms present in this
                                                                       vocabulary acts as a first level features.

                                                                                          
                                                                                          A i,j = T DM (Corpus)                   (1)

                                                                                         
                                                                                         A i = termf requency(qi )                (2)
                                                                                                   th
                                                                          Where, i represents the i question and j represents the
                                                                       j th term in the vocabulary. In-order to improve the repre-
                                                                       sentation, along with the unigram words, the bi-gram and
                                                                       tri-gram phrases also considered after following above men-
Figure 1: Model Diagram for Distributional Repre-                      tioned preprocessing steps.
sentation of Text
                                                                       2.4    Vector Space Model of Semantics : Distri-
                                                                              butional Representation
   Considering above said pros and cons, here the proposed               The above computed TDM is applied on NMF to get the
approach is experimented to observe the performance of dis-            distributional representation of the given corpus.
tributional semantic representation of text in the classifica-                                           
tion and retrieval task. The given questions are represented                               W i,r = nmf ( A i,j )          (3)
as a TDM matrix after the necessary preprocessing steps and
NMF is applied on it to get the distributional representation.            In general matrix factorization is done to get the prod-
Thereafter, distance measure and correlation measures be-              uct of matrices, subject to their reconstruction that the er-
tween entropy vector of each class and vector representation           ror needs to be low. The product components from the
of the question are computed in order to perform the ques-             factorization gives the characteristics of the original matrix
tion classification task and in order to retrieve the relevant         [10, 15]. Here NMF is incorporated along with the pro-
tweets with respect to the given query, cosine distance be-            posed model to get the principal characteristic of the ma-
tween query and tweets are measured.                                   trix, known as basis vector. Sentences may vary in its length
                                                                       but their representation needs to be of fixed size for its use
                                                                       in various applications. TDM representation followed by the
2.    DISTRIBUTIONAL REPRESENTATION                                    Non - Negative Matrix Factorization (NMF) will achieve this
   This section describes about the distributional represen-           [16] . Mathematically it can be represenated as,
tation of the text, which is used further for the question
classification and retrieval tasks. The systematic approach
for the distributional representation is given in Figure 1.                                     A ≈ W HT                          (4)

2.1    Problem Definition                                                If A is m × n original TDM matrix, then W is i × r basis
                                                                       matrix and H is j ×r mixture matrix. Linear combination of
  Let, Q = q1 , q2 , q3 , ..., qn are the questions (qi represents     basis vectors (column vectors) of W along with the weights
the ith question) , C = c1 , c2 , ..., cn are the classes which the    of H gives the approximated original matrix A. While fac-
questions falls under and n is size of corpus. T = t1 , t2 , ..., tn   torizing, intially random values are assigned to W and H
are the tweets which the questions are related and n is size of        then the optimization function is applied on it to compute
corpus. The objective of the experimentation is to classify            appropriate W and H.
each query into its respective predefined classes in task 1
and retrieving the relevant tweets with respect to the input
                                                                                                                     2
query in task 2.                                                                      minfr (W, H) ≡ V − W H T                    (5)
                                                                                                                     F
2.2    Preprocessing                                                                           s.t. W, H ≥ 0
  Few of the terms that appears across multiple classes will
shows conflict towards the classification, where the terms                Here F is Forbenius norm and r is parameter for dimen-
generally gets low weighs in TF-IDF representation. Hence              sion reduction, which is set to be 10 to have i × 10 fixed size
these terms are eliminated if it occurs more than 3/4 times            vector for each question.
                                                                            Description                Train file   Test file
                                                                            # Questions                     330.0    180.0
                                                                       Total # Unique Words                 407.0    282.0
                                                                           # Words after                    220.0    133.0
                                                                             Filtering
                                                                         Average # Words                    0.67      0.74
                                                                           per Question
                                                                            # Bi-Grams                      207.0    118.0
                                                                           after Filtering
                                                                       Average # Bi-Grams                   0.63      0.66
                                                                          per Question
                                                                           # Tri-Grams                      92.0      55.0
                                                                           after Filtering
                                                                       Average # Tri-Grams                  0.28      0.31
                                                                           per Question
                                                                         Total # Features                   519.0    306.0

                                                                                Table 1: Data-set Statistics

                                                                               Measured Feature Functions
                                                                                 Similarity (Dot Product):
                                                                                           PT ∗ Q
 Figure 2: Model Diagram of Proposed Approach
                                                                                      Euclidean
                                                                                       qP        Distance:
                                                                                           d              2
                                                                                           i=1 |Pi − Qi |
   Here NMF is used for finding out the basis vector for the
following reasons: the non-negativity constraints makes in-                        Bray Curtis Dissimilarity:
                                                                                           Pd
terpretability straight forward than the other factorization                                i=0 |Pi −Qi |
                                                                                           Pd
methods; selection of r is straight forward; and the basis                                  i=0 (Pi +Qi )

vector in semantic space is not constrained to be orthogo-                           Chebyshev Distance:
nal, which is not affordable by finding singular vectors or                             min |Pi − Qi |
eigen vectors [8].                                                                          i

                                                                                          Correlation:
                                                                                         Pd (Pi −Qi )2
3.   QUESTION CLASSIFICATION                                                                i=1     Qi
   Question Answering (QA), systems becoming necessary
units in all the industry as well as the non - industrial                Table 2: Measured Similarity Features
domains. Especially, they try to automate the manual ef-
forts required in the personal assistance systems and virtual
agents. With this information the remaining part of the                               Rc = rc1 , rc2 , ..., rcn                 (7)
section describes about the proposed approach in question
classification task.                                                 Then the similarity measures between question vector qi
   For this experiment the data set has been provided by          and reference vectors in R are computed. Similarity mea-
Mixed Script Information Retrieval (MSIR) task committee          sures computed are given in table 2. These similarity mea-
[3, 2]. The detailed statistics about the training and the        sures that is computed are taken as the attributes for the
testing set are given in Table 1.                                 supervised classification algorithm.
   The objective of task is to classify the given question into      The Random Forest Tree (RFT) with nC√n number of
its corresponding class. The distributional representation        trees are utilized to perform the supervised classification.
of the given training and testing corpus are computed as          In order to ensure the performance, 10-fold 10-cross vali-
described in the previous section. The systematic diagram         dation performed during the training and this yields 82%
for the remaining approach is given in Figure 2.                  as precision. Proposed approach yields 79.44% as accuracy
   After the representation, the reference vector for the each    measure against the test set and statistics about the results
class is computed by summing up the question vectors in           are tabulated in Table 3. There are totally 3 runs were sub-
that class. This reference vector acts as a entropy vector for    mitted to the task committee, which has changes in final
its corresponding class. This is mathematically represented       classification algorithm. In this paper we described about
as,                                                               the approach that yields best performance.

                                c
                                X                                 4.   INFORMATION RETRIEVAL
                         rc =       qi                     (6)      The information shared through the social media is huge
                                                                  and it has various challenges in its representation. This
                          s.t. qi ∈ c                             induces to carryout research in order to obtain useful in-
       Team                     Accuracy            PER       LOC       ORG     NUM      TEMP       MONEY         DIST     OBJ     MISC
    AmritaCEN                     80.55             0.82      0.81      0.56     0.92     1.00       0.77          0.97    0.57     NA
 AMRITA-CEN-NLP                  79.44              0.80      0.89      0.60    0.85      0.97       0.72         0.95     0.58     NA
        Anuj                      81.11             0.81      0.81      0.55     0.98     0.97       0.96          0.95    0.50     NA
   BITS PILANI                    81.11             0.72      0.74      0.74     0.96     0.92       0.94          0.95    0.50     0.20
       IINTU                      83.33             0.80      0.89      0.65     0.94     1.00       0.81          0.97    0.53     NA
    IIT(ISM)D*                    80.00             0.77      0.89      0.61     0.89     0.94       0.76          0.95    0.58     NA
   NLP-NITMZ                      78.88             0.85      0.81      0.70     0.89     0.95       0.74          0.92    0.33     0.20

                                                                    Table 3: Results


formation out of it. IR is being part of such a research,                        The proposed distributional representation based approach
which is basic component in text analytics and serves as a                     yields 0.0377 mean average precision against the test queries,
input to the other applications. One of the major problem                      which is best amongst the other approaches proposed in this
is handling the transliterated texts in IR. These transliter-                  task [2]. The statistics about the obtained results are given
ated texts introduces more complex problem especially with                     in Table 5.
representation.
   For this experiment the data set has been provided by                                  Team          Mean Average Precision
Mixed Script Information Retrieval (MSIR) task committee                                    UB                 0.0217
[3]. The detailed statistics about the training and the testing                            Anuj                0.0209
set are given in Table 4.                                                              Amrita CEN              0.0377
                                                                                       NLP-NITMZ               0.0203
           Description               Train file         Test file                      NITA NITMZ              0.0047
          # Questions                  11238.0           5090.0                        CEN@Amrita              0.0315
                                                                                        IIT(ISM)D              0.0083
         Total # Words                 19654.0           11994.0
         # Words after                 13616.0           6756.0                                    Table 5: Results
           Filtering
       Average # Words                    1.21               1.32
         per Question                                                          5.   CONCLUSION
          # Bi-Grams                   46673.0           16494.0                  The classification task and retrieval task are developed
         after Filtering                                                       based on the distributional representation of the text by
     Average # Bi-Grams                   4.15               3.24              utilizing term - document matrix and non-negative matrix
        per Question                                                           factorization. The proposed approach outperformed well
                                                                               in both the task, but there is still room for the improve-
         # Tri-Grams                   56513.0           11961.0               ment. Though the distributional representation methods
         after Filtering                                                       performed well, it suffers from the well known problem ’Curse
    Average # Tri-Grams                   5.03               2.35              of Dimensionality’. It requires a much research in feature en-
        per Question                                                           gineering, which directly reduces the dimension of the term
       Total # Features                116802.0          35211.0               - document matrix. Hence the future work will be focused
                                                                               on improving performance of the retrieval and reducing the
                                                                               dimensionality of the representation basis vectors.
              Table 4: Data-set Statistics

  The objective of this task is to retrieve the top 20 relevant                6.   REFERENCES
tweets from the corpus with respect to the input query. Pri-                    [1] C. C. Aggarwal and C. Zhai. A survey of text
marily queries and corpus are distributionally represented                          classification algorithms. InMining text data, pages
as described in the section 3.                                                      163–222, 2012.
                                                                                [2] S. Banerjee, S. Naskar, P. Rosso, S. Bandyopadhyay,
                     Q = q1 , q2 , q3 , ..., qn                        (8)          K. Chakma, A. Das, and M. Choudhury. Msir@fire:
                                                                                    Overview of the mixed script information retrieval. In
                       T = t1 , t2 , ..., tn                           (9)          Working notes of FIRE 2016 - Forum for Information
                                                                                    Retrieval Evaluation, Kolkata, India, December 7-10,
Then the cosine distance between the query and the cor-
                                                                                    2016, CEUR Workshop Proceedings. CEUR-WS.org,
pus vectors are calculated to retrieve the top 20 tweets with
                                                                                    2016.
minimum distance. Mathematically it is expressed as,
                                                                                [3] S. Banerjee, N. Sudip Kumar, P. Rosso, and
                                    Pn                                              S. Bandyopadhyay. The first cross-script code-mixed
                                        i=1 qi ti                                   question answering corpus. In Modelling, Learning and
            similarity = pPn           2
                                           pPn           2
                                                                      (10)
                                  i=1 Ai            i=1 Bi
                                                                                    mining for Cross/Multilinguality Workshop, pages
                                                                                    56–65, 2016.
                             cos−1 (similarity)                                 [4] H. B. Barathi Ganesh, M. Anand Kumar, and K. P.
               distance =                                             (11)          Soman. Amrita cen at semeval-2016 task 1: Semantic
                                      π
     relation from word embeddings in higher dimension.
     Proceedings of SemEval-2016, pages 706–711, 2016.
 [5] W. Blacoe and M. Lapata. A comparison of
     vector-based representations for semantic composition.
     In Proceedings of the 2012 Joint Conference on
     Empirical Methods in Natural Language Processing
     and Computational Natural Language Learning, pages
     546–556, 2012.
 [6] S.-H. Cha. Comprehensive survey on
     distance/similarity measures between probability
     density functions. City, 1, 2007.
 [7] R. Juan. Using tf-idf to determine word relevance in
     document queries. In Proceedings of the first
     instructional conference on machine learning, 2003.
 [8] D. D. Lee and H. S. Seung. Learning the parts of
     objects by non-negative matrix factorization. 1999.
 [9] A. Manwar, H. Mahalle, K. Chinchkhede, and
     V. Chavan. A vector space model for information
     retrieval: A matlab approach. Indian Journal of
     Computer Science and Engineering, 3:222–229, 2012.
[10] R. Pat. An introduction to latent semantic analysis.
     Indian Journal of Computer Science and Engineering.
[11] U. Reshma, H. B. Barathi Ganesh, and
     M. Anand Kumar. Author identification based on
     word distribution in word space. In Advances in
     Computing, Communications and Informatics
     (ICACCI), pages 1519–1523, 2015.
[12] A. Roshdi and A. Roohparvar. Review: Information
     retrieval techniques and applications.
[13] G. Salton, W. Anita, and Y. Chung-Shu. A vector
     space model for automatic indexing. Communications
     of the ACM, 18:613–620, 1975.
[14] R. Socher, E. Huang, J. Pennin, C. Manning, and
     A. Ng. Dynamic pooling and unfolding recursive
     autoencoders for paraphrase detection. Advances in
     Neural Information Processing Systems, pages
     801–809, 2011.
[15] W. Xu, X. Liu, and Y. Gong. Xu w, liu x, gong y.
     document clustering based on non-negative matrix
     factorization. In Proceedings of the 26th annual
     international ACM SIGIR conference on Research and
     development in information retrieval, pages 267–273,
     2003.
[16] Y. Ye. Comparing matrix methods in text-based
     information retrieval. Tech. Rep., School of
     Mathematical Sciences, Peking University, 2000.