-

Multi-Arm Active Transfer Learning for Telugu Sentiment Analysis

Subba Reddy Oota

oota.subba@students.iiit.ac.in 0

Vijaysaradhi Indurthi

Mounika Marreddy

mounika0559@gmail.com 1

Sandeep Sricharan Mukku

sandeep.mukku@research.iiit.ac.in 0

Radhika Mamidi

radhika.mamidi@iiit.ac.in 0 0 International Institute of Information Technology , Hyderabad 1 Quadratyx , Hyderabad

62 63

Transfer learning algorithms can be used when sufficient amount of training data is available in the source domain and limited training data is available in the target domain. The transfer of knowledge from one domain to another requires similarity between two domains. In many resource-poor languages, it is rare to find labeled training data in both the source and target domains. Active learning algorithms, which query more labels from an oracle, can be used effectively in training the source domain when an oracle is available in the source domain but not available in the target domain. Active learning strategies are subjective as they are designed by humans. It can be time consuming to design a strategy and it can vary from one human to other. To tackle all these problems, we design a learning algorithm that connects transfer learning and active learning with the well-known multi-armed bandit problem by querying the most valuable information from the source domain. The advantage of our method is that we get the best active query selection using active learning with multi arm and distribution matching between two domains in conjunction with transfer learning. The effectiveness of the proposed method is validated by running experiments on three Telugu language domain-specific datasets for sentiment analysis.

Active Learning Transfer Learning Multi-Arm Bandit

People comment on online reviews and blog posts in social media about trending activities in their regional languages. There are many tools, resources and corpora available to analyze these activities for English language. However, not many tools and resources are available to analyze these activities in resource poor languages like Telugu. With the dearth of sufficient annotated sentiment data in the Telugu language, we need to increase the existing available labeled datasets in different domains. However, annotating abundant unlabeled data manually is very time-consuming, cost-ineffective, and resource-intensive.

To address the above problems, we propose a Multi-Arm Active Transfer Learning (MATL) algorithm, which involves transfer learning [ 1 ] and a combination of query selection strategies in active learning [ 3 ]. One of the prerequisites for transfer learning is that the source and target domains should be closely related. We use Maximum Mean Discrepancy (MMD) [ 2 ] as a measure to find the closeness between two distributions of the source and target domains. In this paper, we experiment with sentiment analysis of Telugu language domain specific datasets: Movies, Political and Sports1. By considering each domain as the source or target domain, we have a total of 6 domain pairs: M-P, M-S, P-M, P-S, S-M, S-P. Figure 1 shows two domain pair results. We evaluate the accuracy with three different classification techniques viz., support vector machines (SVM), extreme gradient boosting (XGBoost), gradient boosted trees (GBT), and meta learning of all these approaches and record the accuracy. 2

Approach & Results In Multi-Arm active transfer learning approach, it takes both source domain: S = {unlabeled data instances (SU ), labeled data instances (SL)}, and target domain: T = {unlabeled data instances (TU ), labeled data instances (TL), test data instances (TT ) (used for measuring classification accuracy at each iteration)}, iterations (n) as an input. A decision making model is built along with this approach to predict the posterior probability for each instance of SU . After calculating the sampling query distribution φ(S(n)), based on multi-arm bandit approach a best sample instance xin ∈ S is selected for querying. If xin ∈ SU , then this selected sample instance (xin ) is labeled with an oracle/labeler as yin and added to SL. Now the classifier (Cn) is trained on the total set {updated SL,TL}. Using MMD [ 2 ], the distance between two distributions is calculated. This process is repeated until reached query budget. The classification model Cn is tested on target test data TT to measure the accuracy. The reward (rn(ak(n))) and observation(on(ak(n))) is updated by comparing the label yin given by the oracle/labeler with the classifier (Cn(xin )).

Uncertainty Sampling Random Sampling QUIRE QBC

DWUS 0 50 100 1N50umb20e0r of25q0ueri3e00d in3s5t0ance40s0 45M0ATL500 (b) S-P 0.7 0.5

Uncertainty Sampling Random Sampling QUIRE QBC

DWUS 0 50 100 1N50umb20e0r of25q0ueri3e00d in3s5t0ance40s0 45M0ATL500 (a) P-S 0.72 0.7 0.68 ) 0.66 % ( cy0.64 ra cu0.62 c A 0.6 0.58 0.56

1. Gong , B. : Discriminatively learning domain-invariant features for unsupervised domain adaptation . ( 2013 )

2. Gretton , A. , Smola , A.J.:

A kernel method for the two-sample-problem (

2007 )

3. Settles , B. : Active learning literature survey . Tech. rep. ( 2010 ) 1 https://github.com/subbareddy248/Datasets/tree/master