-

Statistical Semantics in Context Space: Amrita CEN@Author Pro ling

Barathi Ganesh HB

barathiganesh.hb@tcs.com 0

Anand Kumar M

m_anandkumar@cb.amrita.edu 1

Soman KP

kp_soman@amrita.edu 1 0 Arti cial Intelligence Practice, Tata Consultancy Services , Kochi - 682 042, Kerala , India 1 Center for Computational Engineering and Networking, Amrita School of Engineering , Coimbatore , Amrita Vishwa Vidyapeetham, Amrita University , India

Languages shared by people di ers due to diversity in their ethnicity, socioeconomic status, gender, language, religion, sexual orientation, geographical area, accents, pronunciation and word usages. This eventually fall into hypothesis that they follow unknown hidden pattern. By using this hypothesis, determining the class of a person such as age, gender, their personality and nativity has multiple applications in social media, forensic science, marketing analysis, e-commerce and e-security. This work advances the research on author pro ling much further by overcoming existing language dependent, domain dependent and lexicon based author pro ling methods by nding user's sociolect aspects based on authors statistical pattern of semantics in context space. It proves to be a domain and language independent method in Author Pro ling by nearing constant performance over English, Dutch and Spanish corpus.

Author Pro ling Context Space Distributional Representation

The amount of language sharing through Internet is prevalent due to the rapid growth of the social media resources like Facebook, Twitter, LinkedIn, Pinterest and chat resources like Hike, Whatapp, Wechat [ 1 ]. This positive growth ensures and encourages the recommendation and Internet marketing among users on a particular resource. It has been used in business organization for the marketing, market analysis, advertising and connecting with customers [ 2 ]. This ensures the need of Author Pro ling (AP) [15] in-order to discover user's sociolect aspects from shared language. The complication involved here is unlike natural language the language shared by the people on social media is small and unfortunate to extract information easily out of it.

People started revolving around authorship tasks, right from the ancient Greek play- wright times. Recognizing the age, gender, native language, personality and many more facets that frame the pro le of a particular person. It nds application in di erent zones like forensic security, literary research, marketing analysis, industries, on-line messengers, e-commerce , chats in mobile applications, medical applications to treat neuroticism and many more [ 2 ]. Forensic Linguistics came into existence only after 1968. In this sector, police register is one of the area under security, in which the statements taken down by the police act as a source text for Author Pro ling (AP). Legal investigation continues its examination on all elds of suspicion.

In marketing, on-line customer reviews in blogs and sites helps the consumers in deciding his/her choice about shopping a product. Detecting the age and gender of the person who posted his/her feedback paves way for the owners to improve their business strategy [ 2 ]. Industries get bene ted with customer's suggestions and reviews from which they could group the most likely products based on the gender and age. Twitter and Facebook are the popularly used sites for social media. In last year survey, it shows that every month there are about 236 million users who sign up to the micro blogging site-Twitter and 1.44 billion users to Facebook but among them 83.09 millions are fake accounts [ 1 ]. The author with age under 13 and author having more than one account noted as fake account which has to be taken care. There may also be anon who tend to have many fake id's and post messages and chat with innocent people in order to trap them.

In general Machine Learning (ML) algorithm can be used to attain this objective if subjected to relevant features and most of the existing methods follow the same [ 3 ][ 4 ]. In existing methods common and most commonly used features for AP are author's style based features (punctuation marks, usage of capitals, POS tags, sentence length, repeated usage of words, quotations), content based features (topic related words, words present in dictionaries), content and typographical ease, words that express sentiments and emotions with emoticons, special words from which information could be extracted, collocations and ngrams. These features are dependent on lexicon which varies with respect to the topic, genre and language. In ML, the low dimensional condensed vectors which exhibit a relation between the terms, documents and the pro le was built using Concise Semantic Analysis (CSA) in order to create the second order attributes (SOA) which was classi ed using a linear model and also became sensitive to high dimensional problems . This system was further extended in 2014 to make it more precise in pro ling. With the generation of highly informative attributes (creating sub pro les) using Expectation Maximization Clustering (EMC) algorithm, the system built was able to group sub classes within a cluster and exhibit a relation between sub pro les. Though this system was successful,it was dependent to the language and genre [ 3 ][ 4 ].

The syntactic and lexical features utilized in earlier models vary with respect to the morphological and agglutinative nature of the language. These features also varies with respect to the domain in which the AP is performed. There exists a con ict between classifying algorithms to learn from these features in order to build a uni ed and e ective classi cation model which is independent of domain and language. This can be observed from system's performance in PAN - AP shared task[ 3 ][ 4 ][ 5 ].

In order to overcome these con icts, this paper proposes a model based on statistical semantics from author's digital text. Statistical semantics paves way to the advancement in research of relational similarity by including statistical features of word distribution along with traditional semantic features utilized in Latent Semantic Analysis (LSA) [ 6 ]. It is clear that sexual aspects and vocabulary knowledge of a person varies due to human's cognitive phenomena which induces and also limits people of a particular gender and age group to utilize certain range of words to convey their message. Thus by utilizing this word distribution in context space and their statistical features, the gender and age group of a particular author is identi ed in this work. The basic idea is to utilize the distributional representation of an author's document to aggregate the statistical semantic information and promote a set of constraints for nding related hypotheses of that author's document. 2

Related Works

John collected large number of tweets and also evaluated it with people work using Amazon Mechanical Turk (AMT). Their data included 213 million tweets on the whole from 18.5 million users. Tweets collected were multilingual. As tweets include many more contents like emoticons, images etc., feature extraction part was limited to a particular n-gram length with total distinct features of 15,572,522. Word level and character level n-grams were chosen. There was no language speci c processing done but instead only n-gram counts were taken into account. Once features were extracted classi ers namely SVM, Naive Bayes and Winnow2 were evaluated out of which Winnow2 performed exceptionally well with an overall accuracy of 92%. Their work was done only for gender classifying gender information [ 6 ].

Lizi told that the entrance to colossal measure of client produced information empowers them to examine lifetime semantic variety of individuals. The center reason of the model is that age impacts the point piece of a client, and every subject has an interesting age conveyance. They made use of Gibbs EM algorithm for evaluating their model. They were able to nd information of both word distribution and age distribution from the sample of twitter data they collected. They treated tweets as bag of words content thus performing well and e ectively mapping the topic to ages [ 7 ].

Pastor framed their methodology by utilizing the thought of second request properties (a low dimensional and thick record representation), yet goes past consolidating data among every objective pro le. The proposed representation extended the examination fusing data among writings in the same pro le, this is, they concentrated in subpro les. For this, they naturally discover subpro les and assemble report vectors that speak to more itemized connections of archives and subpro le records. Results shows proof of the helpfulness of intra-pro le data to focus sex and age pro les. The sub pro le or intra pro le information of each author was found using Expectation Maximization Clustering (EMC) algorithm [ 8 ].

Suraj uses MapReduce programming standard for most parts of their preparation process, which makes their framework quick. Their framework uses word n-grams including stopwords, accentuations and emoticons as components and TF-IDF (term recurrence reverse report recurrence) as the measuring plan. These were bolstered to the logistic relapse classi er that predicts the age and sexual orientation of the creators. Mapreduce distributed their tasks among many machines and made their work more easy and fast [ 9 ].

Unlike PAN 2016, in PAN 2013, PAN 2014 and PAN 2015 the training and testing were done on similar domains. In most of the work author's stylistic features, readability, speci c domain features (Emoticons, Hash tags), lexical features, LSA based features along with the projection based classi ers, regression based classi ers and clustering based classi ers are used to achieve the objective. In most of the proposed systems varying in its accuracy for di erent domain and language [ 3 ][ 4 ][ 5 ] 3

Mathematical Background

This section rst presents the problem de nition followed by the mathematical modeling to the idea described in section 1 for building AP model. 3.1

Problem De nition

In general the solution is to build a training model from the given problem set pt = d1; d2; :::; dm and to map each document's author to a speci c gender and age group pt (gender; age group). 3.2

Training Phase

Step 1 - Constructing document - term matrix [Vi;j ]m n, where m is total number of documents (total number of authors) in pt, n is size of the vocabulary [ 10 ] and [Vi;j ] = term f requency (vi;j ) (1 < i > m) and (1 < j > n)

[V ] = V SM (pt)(1 < t > m)

Step 2 - Underlying semantic information and relation between authors's documents can be obtained using latent vector by nding basis vectors of V V T which is column space of V . This column called as the context space with respect to the author's documents. Thus the computed basis vectors spans the context space by satisfying the following condition [ 11 ], ~ V

In equation 3, W is m r basis matrix and H is n r coe cient matrix. Linear combination of basis vectors (column vectors) of W with coe cients of H gives the context matrix V . While factorizing, formerly random values are assigned to W and H then the optimization function in equation 4 is applied on it to compute appropriate W and H. Where, r is the reduced dimension and F is the Frobenius norm. Here r xed as m to have m m context matrix. The basis vectors in W considered as the basis vector of context space, which are linearly combined with elements in the H to recompute the V . The singular or eigen vector based computation methods avoided here, since they constrained and forced to found the orthogonal basis vectors. This may not form the exact context space of the author's documents. Since the occurrence of the words in a documents cannot be a negative value, which is a ordable by NMF and the non-negativity constraints makes interpretability straight forward than the other factorization methods [ 12 ].

Each element in the matrix W is a distributed representation of the semantic information of the author's documents in context space. This is known as Vector Space Model of Semantics (VSMs) [ 10 ] but in this application it captures user's cognitive ability and will be called as statistical semantics. Using these base vector it is possible to span the space, where the di erent representation of similar semantics lies.

[W ]n n = [x1; x2; :::; xm]

[W ] = V SM s([V V T ])

Step 3 - The statistical features of semantic distribution in a context space are computed in order to build supervised classi cation model. Statistical features include the marginal decision boundaries with respect to word distribution in each document vector Wi based on each class which has to be classi ed. Performing NMF moves the values in W from discrete to continuous. Thus by taking Wi as a random variable 1 and by xing random variable 2 as other distributions (Normal, Gamma, Chisquare, Rayleigh and Pareto Distributions), the correlation, null hypothesis between them are measured. This is expressed as,

[F ]n s = statistical f eatures [W ]n n Where, s is number of statistical features and F is feature matrix for building classi cation model. From the above it is clear that the extracted features are only dependent on how the author's semantic distribution lies in a document.

Step 4 - In order to build classi cation model, the regression relation between the feature and the respective class are constructed using Random Forest tree algorithm which is a collection of Decision trees that formulates the classi cation rule based on randomly selected features in training set . From (5) (6) (7) L = f(yi; Fi) ; 1 < i > ng the subset of Lb is formed using ps and b number of aggregate predictor is built [ 13 ]. Then nal predictor is built by, 'b (F ) = maxJ HJ (8) Where, J is number of decision trees and HJ = f' (F; Lb) = J g.

The gender and age group classi cation model is built using hierarchical method. In order to constrain the model, after nding gender information it will be fed as additional features to the age group classi cation model, where gender information act as a binary feature. In training there are two models built for gender and age classi cation. This is further detailed in following testing phase. 3.3

Testing Phase

Step 5 - As similar to training set except Step 4, the test set pt = fd1; d2; :::; dn1g also follows all remaining steps to compute feature vectors. Further classi cation of document into gender and age group is performed using b aggregate predictors in hierarchical method. The nal class is assigned based on voting method. The test features [Ft] are initially classi ed into male/female and padding is done as an additional feature for further age group classi cation.

The algorithm for training and testing are shown below, Input pi = fd1; d2; :::; dng for i=1 to n do

[V ] = V SM (di) end [W ] = N M F V V T [F ] = statistical f eature([W ]) modelgen = rf t ([Ffinal; b]) modelage = rf t ([Ffinal; gender]) ygen = predict (modelgen; [F ]) yage = predict (modelage; [F; ygen])

Algorithm 1: Training and Testing 4

Experiment and Observations

The model diagram for performing AP is given in Figure 1. The data-set chosen for this experimentation is from the PAN CLEF AP 2016 workshop [15][16] which are built with challenges involved in real word applications. The 2016 corpus incorporates Twitter data of users for training and reviews, blog data of the authors taken as test data. This is also included in three languages (English, Dutch and Spain). Among them Dutch data-set does not have age-group information.

During pre-processing, author's text alone extracted for further actions. As detailed in problem de nition section, the total documents (Author's tweets) represented as a Document-Term matrix (m n). This matrix multiplied with its transposed version to get the document-document co-occurrence matrix of size m m. Further NMF applied on the document-document matrix to get the basis vector matrix (Context Matrix) with r = m. The basis vector of the each author's document considered as a random variable and its correlation with other distribution random variables mentioned in the section 2 are measured as the features. This nal, feature matrix is utilized to construct the classi cation model which is built using Random Forest Tree classi ers. Here the classi cation model is built based on 100 decision trees, constructed to form the random forest tree. Initially gender classi cation is performed and by feeding its result to the feature matrix, age group is classi ed. Same process applied on the three languages without any change. All the above done in Python and its packages (Scikit Learn and Scipy). 10 fold cross validation is performed to measure the training performance and given in the Table 1. The measures performed on individual (English, Dutch and Spanish) and combined data-sets (English and Spanish). Thought the proposed performance not greater in accuracy, from results it can be observed that, proposed model shows constant accuracy over the all the language and genre. This ensures that proposed model act as the language and domain independent method. 5

Conclusion

With the global need for author pro ling system this experimentation has brought forth a simple, uni ed and reliable model for nding the demographic features of an individual by extracting statistical semantics property of context space. This is achieved by incorporating the Document - Term Matrix, Non - Negative Matrix Factorization and statistical features along with the Random Forest Tree classi er. From the results it can be concluded that this serves as the domain and language independent method, however there is still room for improvement. The future work will be extending and implementing proposed algorithm on distributed computation frameworks like Apache Hadoop and Apache Spark. 14. Leo Breiman: Random forests. Machine learning, (2001) 15. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Evaluations Concerning Cross-genre Author Pro ling. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEURWS.org (Sep 2016) 16. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Reproducibility of PANs Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268299. Springer, Berlin Heidelberg New York (Sep 2014)

Andrew

Perrin : Social Media Usage: 2005 - 2015 . 2015

2. Mangold , W.

Glynn , and David J.

Faulds : Social media: The new hybrid element of the promotion mix . Business horizons , ( 2009 )

3. Rangel , Francisco, Efstathios Stamatatos, Moshe Moshe Koppel, Giacomo Inches, and Paolo Rosso: Overview of the author pro ling task at pan 2013 . In CLEF Conference on Multilingual and Multimodal Information Access Evaluation , ( 2013 )

4. Rangel , Francisco, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, and Walter Daelemans: Overview of the 2nd author pro ling task at pan 2014 . CLEF Evaluation Labs and Workshop, ( 2014 )

5. Rangel , Francisco, P. Rosso, M.

Potthast , B.

Stein , and W.

Daelemans: Overview of the 3rd Author Pro ling Task at PAN 2015 . In

CLEF

, ( 2015 )

Barathi

Ganesh

HB , Reshma

U , and Anand Kumar M: Author identi cation based on word distribution in word space . Advances in Computing, Communications and Informatics (ICACCI) , ( 2015 )

7. Burger , John D. , John Henderson, George Kim, and Guido Zarrella: Discriminating gender on Twitter . In Proceedings of the Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics ( 2011 )

8. Liao , Lizi, Jing Jiang, Ying Ding, Heyan Huang, and Ee Peng LIM: Lifetime lexical variation in social media . ( 2014 )

9. Lpez-Monroy , Adrin Pastor, Manuel Montes-y- Gmez , Hugo Jair Escalante, and Luis Villaseor Pineda: Using Intra-Pro le Information for Author Pro ling . In CLEF (Working Notes) , ( 2014 )

10. Maharjan , Suraj, Prasha Shrestha , and Thamar Solorio: A Simple Approach to Author Pro ling in MapReduce . In CLEF (Working Notes) , ( 2014 )

11. Turney , Peter D. , and Patrick Pantel: From frequency to meaning: Vector space models of semantics . Journal of arti cial intelligence research , ( 2010 )

12. Lee , Daniel D and Seung, H Sebastian: Learning the parts of objects by nonnegative matrix factorization . Nature Publishing Group, ( 1999 )

13. Xu , Wei, Xin Liu, and Yihong Gong: Document clustering based on non-negative matrix factorization . Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM , ( 2003 )