-

Distributional Semantic Representation for Text Classification and Information Retrieval

Barathi Ganesh HB

barathiganesh.hb@tcs.com 0

Anand Kumar M and Soman KP

anandkumar@cb.amrita.edu m anandkumar@cb.amrita.edu, kp soman@.amrita.edu soman@.amrita.edu 1 0 Artificial Intelligence Practice, Tata Consultancy Services , Kochi - 682 042 , India 1 Center for Computational Engineering and , Networking (CEN) , Amrita School of Engineering , Coimbatore , Amrita Vishwa Vidyapeetham, Amrita University , India

The objective of this experiment is to validate the performance of the distributional semantic representation of text in the classi cation (Question Classi cation) task and the Information Retrieval task. Followed by the distributional representation, rst level classi cation of the questions is performed and relevant tweets with respect to the given queries are retrieved. The distributional representation of text is obtained by performing Non - Negative Matrix Factorization on top of the Document - Term Matrix in the training and test corpus. To improve the semantic representation of the text, phrases are also considered along with the words. This proposed approach achieved 80% as a F-1 measure and 0.0377 as a mean average precision against the its respective Mixed Script Information Retrieval task1 and task 2 test sets.

Information Retrieval (IR) and Text classi cation are the classic applications in text analytics domain, that is utilized in the multiple domains and industries in various forms. Given a text content, the classi er must have the capability of classifying it into the prede ned set of classes and given a query, the search engine must have the capability of retrieving relevant text content within the stored collection of text [ 1 ][ 12 ]. This task becomes more complex, when the text contents are represented in more than one language. This introduces the problem during the representation as well as while mining information out of it.

The fundamental component in classi cation and retrieval task is text representation, which tries to represent the given text into its equivalent form of numerical components. Later, these numerical components are utilized directly to perform the further actions or will be used to extract the features required for performing further action.

This text representation methods evolved over the time to improve the originality of representation, which paves way to move from the frequency based representation methods to the semantic representation methods. Though other methods like set-theoretic Boolean systems are also available, this paper focuses only on Vector Space Model (VSM) and Vector Space Model of Semantics (VSMs) [ 13 ].

In VSM, the text is represented as a vector, based on the occurrence of terms (binary matrix) or frequency of the occurrence of terms (Term - Document Matrix) present in the given text. The given text is represented as a vector, based on frequency of terms that occur in the text. Here, 'terms' represents words or phrases [ 9 ]. Considering only the term frequency is not su cient, since it ignores syntactic and semantic information that lies within the text.

The term documents matrix is ine cient due to the biasing problem (i.e. few terms gets higher weight because of unbalanced and uninformative data). To overcome this, Term Frequency - Inverse Document Frequency (TF-IDF) representation method is introduced, which re-weighs the terms based upon its presence across the documents [ 7 ]. It has a tendency to give higher weights to the rarely occurring words, wherein these words may be misspelled which is obvious with social media texts.

The Vector Space Model of Semantics (VSMs) overcomes the above mentioned shortcomings by weighing terms based on the context. This is achieved by applying TDM on matrix factorization methods like Singular Value Decomposition (SVD) and Non - Negative Matrix Factorization (NMF) [ 10, 15, 11 ]. This has the ability of weighing terms though it is not present in a given query. This is because, matrix factorization leads to represent the TDM matrix with its basis vectors [ 5 ]. This representation does not include the syntactic information which requires large data and is computationally high because of its high dimension.

Word Embeddings along with the structure of the sentence are utilized to represent the short texts. This requires very less data and the dimension of the vector can be controlled. But to develop the Word to Vector (Word2Vec) model it requires a very large corpus [ 14 ][ 4 ]. Here we are not considering it since we do not have large size mixed script text data. Followed by representation, similarity measures is carried on in-order to perform the question classi cation task. Here similarity measures are distance measure (Cosine distance, Euclidean distance, Jaccard distance, etc.) and correlation measure (Pearson correlation coe cient) [ 6 ].

Considering above said pros and cons, here the proposed approach is experimented to observe the performance of distributional semantic representation of text in the classi cation and retrieval task. The given questions are represented as a TDM matrix after the necessary preprocessing steps and NMF is applied on it to get the distributional representation. Thereafter, distance measure and correlation measures between entropy vector of each class and vector representation of the question are computed in order to perform the question classi cation task and in order to retrieve the relevant tweets with respect to the given query, cosine distance between query and tweets are measured.

DISTRIBUTIONAL REPRESENTATION

This section describes about the distributional representation of the text, which is used further for the question classi cation and retrieval tasks. The systematic approach for the distributional representation is given in Figure 1. 2.1

Problem Definition

Let, Q = q1; q2; q3; :::; qn are the questions (qi represents the ith question) , C = c1; c2; :::; cn are the classes which the questions falls under and n is size of corpus. T = t1; t2; :::; tn are the tweets which the questions are related and n is size of corpus. The objective of the experimentation is to classify each query into its respective prede ned classes in task 1 and retrieving the relevant tweets with respect to the input query in task 2. 2.2

Preprocessing

Few of the terms that appears across multiple classes will shows con ict towards the classi cation, where the terms generally gets low weighs in TF-IDF representation. Hence these terms are eliminated if it occurs more than 3=4 times across the classes and in order to avoid the sparsity of the representation, terms with the document frequency of one are eliminated. Here TF-IDF representation not considered. Because, it has a tendency to provide weighs for the rare words which is more common in mixed script texts. Here, advantage of the TF-IDF representation is indirectly obtained by handling document frequency of the terms. 2.3

Vector Space Model : Term - Document Matrix

In TDM, vocabulary has been computed by nding unique words present in the given corpus. Then the number of times term presents (term frequency) in each question is computed against the vocabulary formed. The terms present in this vocabulary acts as a rst level features.

A i;j = T DM (Corpus)

A i = termf requency(qi) minfr(W; H)

W HT 2

F s:t: W; H 0

Here F is Forbenius norm and r is parameter for dimension reduction, which is set to be 10 to have i 10 xed size vector for each question.

Where, i represents the ith question and j represents the jth term in the vocabulary. In-order to improve the representation, along with the unigram words, the bi-gram and tri-gram phrases also considered after following above mentioned preprocessing steps. 2.4

Vector Space Model of Semantics : Distributional Representation

The above computed TDM is applied on NMF to get the distributional representation of the given corpus.

W i;r = nmf ( A i;j )

In general matrix factorization is done to get the product of matrices, subject to their reconstruction that the error needs to be low. The product components from the factorization gives the characteristics of the original matrix [ 10, 15 ]. Here NMF is incorporated along with the proposed model to get the principal characteristic of the matrix, known as basis vector. Sentences may vary in its length but their representation needs to be of xed size for its use in various applications. TDM representation followed by the Non - Negative Matrix Factorization (NMF) will achieve this [ 16 ] . Mathematically it can be represenated as, (1) (2) (3) (4) (5) A

W HT

If A is m n original TDM matrix, then W is i r basis matrix and H is j r mixture matrix. Linear combination of basis vectors (column vectors) of W along with the weights of H gives the approximated original matrix A. While factorizing, intially random values are assigned to W and H then the optimization function is applied on it to compute appropriate W and H. Here NMF is used for nding out the basis vector for the following reasons: the non-negativity constraints makes interpretability straight forward than the other factorization methods; selection of r is straight forward; and the basis vector in semantic space is not constrained to be orthogonal, which is not a ordable by nding singular vectors or eigen vectors [ 8 ].

QUESTION CLASSIFICATION

Question Answering (QA), systems becoming necessary units in all the industry as well as the non - industrial domains. Especially, they try to automate the manual efforts required in the personal assistance systems and virtual agents. With this information the remaining part of the section describes about the proposed approach in question classi cation task.

For this experiment the data set has been provided by Mixed Script Information Retrieval (MSIR) task committee [ 3, 2 ]. The detailed statistics about the training and the testing set are given in Table 1.

The objective of task is to classify the given question into its corresponding class. The distributional representation of the given training and testing corpus are computed as described in the previous section. The systematic diagram for the remaining approach is given in Figure 2.

After the representation, the reference vector for the each class is computed by summing up the question vectors in that class. This reference vector acts as a entropy vector for its corresponding class. This is mathematically represented as, c rc = X qi s:t: qi 2 c (6) 330.0 407.0

Then the similarity measures between question vector qi and reference vectors in R are computed. Similarity measures computed are given in table 2. These similarity measures that is computed are taken as the attributes for the supervised classi cation algorithm.

The Random Forest Tree (RFT) with nCpn number of trees are utilized to perform the supervised classi cation. In order to ensure the performance, 10-fold 10-cross validation performed during the training and this yields 82% as precision. Proposed approach yields 79.44% as accuracy measure against the test set and statistics about the results are tabulated in Table 3. There are totally 3 runs were submitted to the task committee, which has changes in nal classi cation algorithm. In this paper we described about the approach that yields best performance.

4. INFORMATION RETRIEVAL

The information shared through the social media is huge and it has various challenges in its representation. This induces to carryout research in order to obtain useful in

Team

AmritaCEN AMRITA-CEN-NLP

Anuj BITS PILANI

IINTU IIT(ISM)D* NLP-NITMZ formation out of it. IR is being part of such a research, which is basic component in text analytics and serves as a input to the other applications. One of the major problem is handling the transliterated texts in IR. These transliterated texts introduces more complex problem especially with representation.

For this experiment the data set has been provided by Mixed Script Information Retrieval (MSIR) task committee [ 3 ]. The detailed statistics about the training and the testing set are given in Table 4.

The objective of this task is to retrieve the top 20 relevant tweets from the corpus with respect to the input query. Primarily queries and corpus are distributionally represented as described in the section 3.

The proposed distributional representation based approach yields 0.0377 mean average precision against the test queries, which is best amongst the other approaches proposed in this task [ 2 ]. The statistics about the obtained results are given in Table 5.

Team

Anuj Amrita CEN NLP-NITMZ NITA NITMZ CEN@Amrita

IIT(ISM)D

Mean Average Precision 0.0217 0.0209 0.0377 0.0203 0.0047 0.0315 0.0083

The classi cation task and retrieval task are developed based on the distributional representation of the text by utilizing term - document matrix and non-negative matrix factorization. The proposed approach outperformed well in both the task, but there is still room for the improvement. Though the distributional representation methods performed well, it su ers from the well known problem 'Curse of Dimensionality'. It requires a much research in feature engineering, which directly reduces the dimension of the term - document matrix. Hence the future work will be focused on improving performance of the retrieval and reducing the dimensionality of the representation basis vectors. 6.

[1]

C. C.

Aggarwal and

Zhai . A survey of text classi cation algorithms . InMining text data , pages 163 { 222 , 2012 .

[2]

Banerjee ,

Naskar ,

Rosso ,

Bandyopadhyay ,

Chakma , A. Das , and M. Choudhury . Msir@ re: Overview of the mixed script information retrieval . In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation , Kolkata, India, December 7- 10 , 2016 ,

CEUR

Workshop Proceedings . CEUR-WS.org, 2016 .

[3]

Banerjee ,

N. Sudip

Kumar ,

Rosso , and

Bandyopadhyay . The rst cross-script code-mixed question answering corpus . In Modelling, Learning and mining for Cross/Multilinguality Workshop , pages 56 { 65 , 2016 .

[4]

H. B.

Barathi Ganesh ,

M. Anand

Kumar , and

K. P.

Soman . Amrita cen at semeval -2016 task 1: Semantic relation from word embeddings in higher dimension . Proceedings of SemEval-2016 , pages 706 { 711 , 2016 .

[5]

Blacoe and

Lapata . A comparison of vector-based representations for semantic composition . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , pages 546 { 556 , 2012 .

[6]

S.-H.

Cha . Comprehensive survey on distance/similarity measures between probability density functions . City , 1 , 2007 .

[7]

Juan . Using tf-idf to determine word relevance in document queries . In Proceedings of the rst instructional conference on machine learning , 2003 .

[8]

D. D.

Lee and

H. S.

Seung . Learning the parts of objects by non-negative matrix factorization . 1999 .

[9]

Manwar ,

Mahalle ,

Chinchkhede , and

Chavan . A vector space model for information retrieval: A matlab approach . Indian Journal of Computer Science and Engineering , 3 : 222 { 229 , 2012 .

[10]

Pat . An introduction to latent semantic analysis . Indian Journal of Computer Science and Engineering.

[11]

Reshma ,

H. B.

Barathi Ganesh , and

M. Anand

Kumar . Author identi cation based on word distribution in word space . In Advances in Computing, Communications and Informatics (ICACCI) , pages 1519 { 1523 , 2015 .

[12]

Roshdi and

Roohparvar . Review: Information retrieval techniques and applications .

[13]

Salton ,

Anita , and

Chung-Shu . A vector space model for automatic indexing . Communications of the ACM , 18 : 613 { 620 , 1975 .

[14]

Socher ,

Huang ,

Pennin ,

Manning , and

Ng . Dynamic pooling and unfolding recursive autoencoders for paraphrase detection . Advances in Neural Information Processing Systems , pages 801 { 809 , 2011 .

[15]

Xu ,

Liu , and

Gong . Xu w, liu x, gong y. document clustering based on non-negative matrix factorization . In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval , pages 267 { 273 , 2003 .

[16]

Ye . Comparing matrix methods in text-based information retrieval . Tech. Rep., School of Mathematical Sciences , Peking University, 2000 .