User Comment Analysis for Android apps and CSPI Detection with Comment Expansion Lei Cen, Luo Si, Ninghui Li Hongxia Jin Computer Science Department Samsung Information Purdue University Systems America West Lafayette, IN 47907, USA San Jose, CA 95134, USA {lcen, lsi, lin}@purdue.edu hongxia.jin@sisa.samsung.com ABSTRACT user comments, including those which may not be so easy Along with the exponential growth on markets of mobile to be discovered from other sources (i.g. unresolved IAP apps, comes the serious public concern about the security issue). However, most of the negative comments are not and privacy issues. User comments serves as a valuable necessarily Comments with Security/Privacy Issues (CSPI). source of information for evaluating a mobile app, for both Some of them are non-informative comments [4], including new users and developers. However, for the purpose of eval- vague statement and pure emotional expressions. Others uation on the security/privacy aspects of an app, user com- may be complaining about attractiveness, quality or even ments are not always directly useful. Most of the comments cost of the app [5], which normally have nothing to do with are about issues like functionality, missing feature or just security/privacy. In order to make use of the comments to pure emotional expression. Therefore, further efforts are reveal the issues that are really related to security/privacy required in order to identify those Comments with Secu- of an app, CSPI need to be detected first to avoid all those rity/Privacy Issues (CSPI) for future evaluation. In this irrelevant comments. This paper provides a novel method to paper, a dataset of comments is collected from Google Play, deal with this problem. A set of comments are first collected and a two dimensional label system is proposed to describe and investigated. Based on observations from the comments, those CSPI within it. A supervised multi-label learning a two dimensional label system is designed to describe CSPI method utilizing comment expansion is adopted to detect from two different perspectives. With this label system, a different types of CSPI described by this label system. Ex- CSPI Detection with Comment Expansion (CDCE) method periments on the collected dataset shows that the proposed is proposed to first use keyword-based filtering method to method outperforms the method without the comment ex- narrow down the scale of comments in concern, then applies pansion. a supervised multi-label learning method to identify different types of CSPI. 1. INTRODUCTION The contribution of this paper are listed as following: New challenges come with the exponentially growing mar- kets of mobile apps. On one side, comparing to traditional • Instead of using a list of label for different issues, this software markets, markets like Google Play and Apple Store paper presents a two-dimensional label system, pic- have lower entering threshold for developers and faster fi- turing the “What” and “When” of the occurrence of a nancial payback, hence greatly encourage more and more reported CSPI, providing an additional dimension to developers to invest in this thriving business. One result better understand the relationship between different out of this is the huge amount of mobile apps with great di- issues. versity. Therefore, to control the quality of apps, especially • A supervised learning method is proposed to solve this the security risk of them across the whole markets becomes multi-label problem. Comment expansion is adopted a important issue to all that involved. On the other side, to utilize the relationship between comments and a public concerns about privacy issues with on-line activity post-process for the relationship between labels. Both and mobile phones are also elevating, demanding a mobile relationships are proven to be useful in improving the environment with more respect to users’ privacy. CSPI detection performance based on the collected dataset. The infection rate of real malicious mobile apps over a mar- ket can only be estimated. It is reported to be about 0.28% The rest of the paper proceeds as follows. Section 2 de- in [10]. The rest of them are assumed benign apps but are scribes related works. Section 3 discuss the collection of not free of security issues. Many misbehaviors of a mobile data and the design of the label system. Section 4 presents app may lead to real security/privacy issues. For example, the method for CSPI detection. Section 5 demonstrates the adding too much or inappropriate advertisement may lead to evaluation process of the proposed method. Section 6 lists phishing and scareware; unresolved In-App-Purchase (IAP) the limitation of this paper with future work and Section 7 may lead to fraud; unnecessary running in background may concludes the paper. be connected with unauthorized access to personal data. User comments are valuable feedback for both new users and developers. Many security/privacy issues can be revealed by 2. RELATED WORKS User comment are utilized by some works [5, 6, 4] to eval- and quality (or attractiveness for Game app) take the ma- uate mobile apps. Some researchers [6] aim at the task jority. Therefore it would be necessary to narrow down the to extract new/changed requirement for new version of the huge set of comment to a more feasible size for further anal- app. It proposes Aspect and Sentiment Unification Model ysis and annotation. In order to do that, a coarse filtering is (ASUM) to extract the topics of comments. ASUM is an applied first to create a set of suspect comments. This filter- extension of Latent Dirichlet Allocation (LDA) [2] by incor- ing is keyword-based, as any comment that contains at least porating both topic modeling and sentiment analysis. An- one of the predefined keywords will be put into the suspect other work [5] uses LDA model to identify different topics set. These keywords are manually picked in an iterative way. from those negative comments, in order to provide insight The initial set of keywords is just {security, privacy}. New about why an app is getting low ratings. Our work focus on keywords are picked from those that have high co-occurrence a finer level to investigate only CSPI, which is part of the probability with current keywords. Comments with the co- negative comments. Also, some researchers [4] propose a occurrence are investigated manually to see how often they novel method for extracting informative review topics from are related to security/privacy issues. The final keywords set user comments. A filtering process is applied first to fil- includes security, privacy, permission, money, spam, steal, ter out non-informative comments, then LDA is adopted for phish, etc.. To avoid mismatch, different forms of the key- generating topics from the informative ones. Our work also words are also considered, i.g. unsecure and insecure for requires a filtering process, but not from the perspective of security, stole and stolen for steal etc. A rating filtering the quality of information, but whether the comment is re- is also applied to only include those comments with poor lated to security/privacy. All these works mentioned above ratings(< 4.0). The resulting suspect set contains 36, 464 make use of topic models (LDA), which is an unsupervised comments from 3, 174 apps. method. For our work, the task is not to generate general summarization of the topics of comments of an app, but to identify a specific part (CSPI) of all the comments and fur- 3.1 CSPI Annotation It is worth noting that the suspect set still contains not only ther distinguish different types of it. Hence unsupervised CSPI. Many of them are not CSPI, due to the simplicity of method like LDA may not be able to generate the expected the keyword-based coarse filtering. Hence the suspect set and specified types of topics. Therefore, supervised method only serves as a base candidate set of comments which are is adopted in our work, and a label system is manually de- suspect for CSPI. Further effort need to be done to actu- signed to provide precise task for the learning process. ally detect CSPI. And different types of CSPI need to be distinguished and treated accordingly. 3. DATA COLLECTION The data used in this work is collected by crawling the Various different issues are found from the comments in the Google Play website. Information about 6, 938 free apps suspect set. It would be exhausting to distinguish each of in Google play are downloaded during September through them without abstract concepts by considering the relation- December in 2013. For user comments, the total number of ship between them. Also, it would be too vague to just comments collected for these 6, 938 apps is 5, 108, 538. The distinguish CSPI from non-CSPI comments without further average number of comments for an app in this dataset is insight of the actual issues. To make a trade-off between 736 and the maximum number is 4500. Figure 1a shows the the complexity and functionality of the annotation system distribution of number of comments among the apps. for describing CSPI, a label General is firstly defined for comments that is in general CSPI. In addition, a two di- 3×106 mensional (Nature, Scenario) label set is defined as shown 4000 2.5×106 in Table 1 to provide finer level identification for different CSPI topics. Among these labels, label Execution is a super 3000 2×106 label for Foreground and Background. And all are sub-label Number of Comments Number of Apps 1.5×106 of label General. 2000 106 The two dimensions of label set serve as “What” (Nature) 1000 and “When” (Scenario) of the underlining issues. The di- 5×105 mension Nature is used to identify the nature of the secu- 0 0 1000 2000 3000 4000 5000 0 1 2 3 4 5 rity/privacy related misbehavior of the app complained by Number of Comments Rating the comments. These behaviors are normally described ex- (a) Histogram for the number (b) Histogram for the num- plicitly in the comments. On the other hand, the dimension of apps against the number of ber of comments with differ- Scenario is used to identify the scenario when the issues oc- its comments in the collected ent ratings in the collected cur. This is sometimes expressed implicitly in the comments. dataset. dataset. System, Privacy and Finance are clearly necessary as the di- rect issues. The label Spam is assigned to the widely used Figure 1: A simple view of the collected dataset. advertising behavior among apps. As a popular and com- mon behavior, however, spamming is closely related some The rating from user comments are highly skewed as shown security issues like phishing, scareware etc., hence is also in- in Figure 1b (this is previously also reported in [5]), indicat- cluded as one type of issue mentioned in CSPI. The label ing that most of the comments are not complaining about an Others is used for other issues not included in System, Pri- issue. In addition, many of the comments with poor ratings vacy, Spam or Finance. For example the topic about some (< 4.0) are not really about security/privacy issues. As ob- apps that can not be normally uninstalled, it is usually an served from the dataset, complains about the functionality app pre-installed by the vendor without any real issues, but Label Definition Issues System Issues causing negative effect to the system Freezing the phone, unauthorized downloading Privacy Issues about getting unauthorized access to user info. Stealing phone number, accounts, emails Nature Spam Issues about unpleasant ads and related. Annoying ads, spam shortcut on home screen. Finance Issues about suspect money stealing. unresolved purchase. Others Security / privacy issues not included above. uninstall issues Before Issues occur before installation of the app. Requiring too much permissions. Scenario Execution Issues occur when the app is executing on the phone. General complains about spam Foreground Issues occur when the app takes the foreground screen. Ads occupies too much screen, phishing Background Issues occur when the app is running background. Notification bar spam After Issues occur after uninstall of the app. email/SMS spam after uninstall Table 1: Two dimensional label set it clearly violates the users’ control over the phone and is example, the sample comment for Spam is also annotated complained about by many users. Another example for Oth- with Background, since it mentions that the spam appears ers labeled comments would be those claims that the app in the notification bar. And the one for Before is also anno- is reported by other security apps for some reason. Since tated with Privacy. these are not direct opinions from the users but only a sus- pects, they are put with Others, instead of with the other 4. CSPI DETECTION four ones according to the reasons said to be reported by The purpose of CSPI detection is to detect certain types of other security apps, which may not be true or even explic- CSPI in the suspect set. This is naturally a multi-label clas- itly expressed in the comment. The label Before is mostly sification problem. Independent Logistic Regression (ILR) assigned together with Privacy to indicating the complains is used as a baseline. The proposed CDCE method adds about the permission required by the app, which is available two improvements upon it: Comment Expansion to embed to the user before the installation. And the label After is similarity between comments and Post-Process with label mostly for the email/SMS spam after uninstall. The label correlation to utilize label correlation. Execution stands for the most common scenario, and it is further divided into two sub-labels: Foreground for the is- 4.1 Feature Extraction sues occur when the app is running in foreground (occupying To represent comments for detection, Bag Of Word (BOW) the screen) and Background for running in background. If feature is extracted from comment text. Proper pre-processes the issue in the comment is not specific about foreground are applied to the text before feature extraction, including or background, only the label Execution is applied, other- removing stop words and stemming. The dimension of the wise both Foreground (Background ) and Execution will be BOW feature is tuned by removing the rarest and most pop- applied. ular words. With the BOW feature and the annotation with the 11 labels, the suspect set can be represented as X and The suspect set is then annotated manually with these la- Y . X = {X1 , X2 , · · · , XN } is the set of the BOW feature bels. Each label is validated by the agreement of two people. vectors for each comments with N the number of comments. Some statistics of the suspect set with respect to the labels And Y = {Y1 , Y2 , · · · , YN } is the set of label vectors for each are shown in Table 2. About 30% of the comment in the sus- comments with Yi = {Yi1 , · · · , Yi11 } and Yij ∈ {−1, 1} with pect set are labeled as CSPI. This also means many of the Yij = 1 indicating the presence of the jth label for comment comments, while mentioning some keywords, are not really i. talking about security/privacy issues. The sample comment for Whole Set is one of them. It mentions stolen not to claim a financial issue but a issue about stolen idea, which 4.2 Independent Logistic Regression (ILR) is not about security or privacy. Some comments talk about There are many works that have been done on multi-label “money”, but only to express that the app is not worth it learning from different perspective. G. Madjarov[7] made an (Although all the apps we collected are free apps, people can extensive empirical comparison of different multi-label learn- still talk about In-App-Purchase and the paid version of the ing methods. In the comparison, Binary Relevance (BR) [11] app.). Another example would be the comments for the apps performs similar to classifier chaining (CC) method [9] . Al- about security (Anti-virus apps, password manager, etc.). A though not the best method among all the competitors, BR lot of keywords are mentioned in these comments since the shows robust performance and has the advantage of sim- apps are exactly about the issues, but the comments are not plicity. For the task of CSPI detection, this paper does necessarily reporting an issue of the apps. Similar things not seek to investigate and improve the multi-label learn- happen with Bank apps. Label Others has a very low per- ing algorithm in general. Therefore, ILR is chosen as the centage, indicating the coverage of the four main natures of baseline, which is just BR framework with LR [1] as the issues is good. base classifier. LR is a widely adopted linear classifier due to its robustness and simplicity. To apply LR to the multi- It is worth noting that misspelling may hinder the perfor- label problem considering independence between labels, one mance of detection. The “baught” in the sample for Finance LR classifier is trained for each of the labels. The objective serves as an example. Also of important is the fact that a function of ILR with 2-norm is as following: CSPI usually has two or more labels besides General. For min N LL(X, wj , bj ) + λ(|wj |22 + |bj |22 ) wj ,bj # % Sample Comment Whole Set 36, 464 100% “Idea stolen from ***, and it feels unfinished, or unprofessional” General 10, 636 29.17% all of the below. System 1, 304 3.58% “It keeps crashing the phone. It becomes a device administrator.” Privacy 2, 513 6.89% “Leak ing GPS location to dvertisers=Top 2 fastest ways to get me to hate your app/guts.” Spam 6, 164 16.90% “Since installing this app I get spam in my notifications constantly.” Finance 1, 191 3.27% “I baught $2.00 in *** and I never got them. In really mad about it.” Others 191 0.52% “I buy my own phone with my own money and I cant delete this app from my phone?” Before 1, 611 4.40% “New versions permissions can steal all my data and contacts!!!” Execution 8, 456 23.19% “No need for them to invade my privacy.” Foreground 2, 011 5.52% “game wont even load due to the pop ups before game starts..splash sceen pop ups suck” Background 4, 171 11.44% “spam emails that I did not send were going out from my email address.” After 36 0.10% “six months after uninstall, *** Spam just keeps on coming” Table 2: Statistics and sample comment pieces on suspect set with long rank list from the retrieval engine will be returned and N the scopes and restrictions will be applied to this list to pick j T “relevant” comments afterwards. X N LL(X, wj , bj ) = ln(1+exp−yi (wj Xi +bj ) ), j = 1, · · · , 11 i=1 With a set of “relevant” comments, the comment expansion where the N LL(X, wj , bj ) is the negative log-likelihood func- can be conducted by making a convex sum of the original tion. λ is the regularization parameter which can be ob- comment and the mean of the set of “relevant” comments on tained by cross validation in training set. This optimization feature level as following: problem can be solved using gradient descent. 1 X fnew = (1 − α)fold + α · f 4.3 Comment Expansion |R| f ∈R User comments has some properties distinguishing them from properly compiled documents. They are normally short, where fold is the BOW feature vector of the original com- with wrongly spelled words (as mentioned in Section 3.1) ment, fnew the feature vector of the expanded comment, and made-up words (i.g. “spamspamspamspam”). Also, dif- and R the set of “relevant” comments with respect to the ferent words or phrases may be used for the same opinion. underlining comment. α serves as a tunable parameter for These properties of user comments may harm the perfor- the degree of effect of the expansion, and will be tuned in mance of CSPI detection since the features may not fully experiment for the best performance, so is the size of R. represent the opinion of the comments. This is in a simi- lar situation with the user query in Information Retrieval Comment expansion is an efficient way to utilize the relation- (IR) problem, where a query submitted to search engine ship between comments. Other choices like kernel method could also be short, misspelled and vary in the choice of may require the similarity computation between each pair words. Due to this motivation, a traditional IR technique of comments, the amount of computation grows exponen- called query expansion with pseudo relevant feedback [12] tially with the number of comments. Comment expansion, is borrowed here to make comment expansion. Originally, on the other hand, relies on retrieval engine, and the cost of this technique uses the top documents in the retrieval re- retrieval for one comment is normally constant, hence the sult rank list with respect to the original query as “relevant” total cost of time is linear to comment number. documents, and use these document to expand the original query, generating a new query resembling the “relevant” doc- uments. This query expansion is reported to almost always 4.4 Post-Process with Label Correlation have a positive effect on the retrieval performance [12]. A Label correlation is used in various ways in multi-label learn- similar process can be adopted for comment expansion as ing [7]. As mentioned in Section 3.1, labels hardly appear following. The comment expansion would have two steps: alone. Hence the correlation between labels could be a valu- able source of information. For the CSPI detection task, 1. Find “relevant” comments via retrieval. a simple post-process is applied to embed this information. 2. Expand original comment with “relevant” comments. The post-process uses a second round of ILR with different feature for the comments. The probability output Ŷ of the The relevance between comments is hereby evaluated us- first round ILR classifiers are used as feature in this second ing the retrieval model. An interesting question would be round of ILR to predict Y . And the correlation matrix of the scope of the retrieval. Should the candidate “relevant” the labels are utilized into a Laplacian norm [8] in these LR comments picked from all other comments? or just from the problems. Let Ŷ = {Ŷi , · · · , YˆN } the estimated label vec- comments from the same app or same category of apps? Be- tor set from the ILR. The objective function of the second sides, the “relevant” comments should only be those posted round LR is as following: before the underlining comment. To avoid making compli- cated query to the retrieval engine, the query contains only min N LL(Ŷ , ŵj , bˆj ) + λ̂(ŵj T Lŵj + |bˆj |22 ) the underlining comment with no constrain. A sufficiently wˆj ,bˆj with micro F1 method N General Scenario Nature All j N LL(Ŷ , ŵj , bˆj ) = X ln(1+exp−yi (wˆj T ˆ Yi +bˆj ) ), j = 1, · · · , 11 ILR 0.7962 0.6713 0.7032 0.7153 i=1 CDCE− 0.8037† 0.6740† 0.7159† 0.7223† CDCE+ 0.8004† 0.6814† 0.7096† 0.7225† where L = A−D is the Laplacian matrix with A the correla- CDCE∗ 0.8037† 0.6836† 0.7159† 0.7263† tion matrix of the labels Y in training set, and D a diagonal matrix with its diagonal elements the sum of each row of A. Table 3: Experiment results on suspect set. † shows the These optimization problems can be solved similarly using statistical significance based on ILR. It is computed over gradient descent. ten different random splits of the training/testing sets, using one-tailed pair-wise t test with α = 0.05. 5. EXPERIMENTS 5.1 Experiment setting The evaluation of the proposed CSPI detection method is the reported micro F1 values are the means over these ten conducted under the suspect set of comments. As a su- settings. pervised method, a training set is required for training the model. The suspect set is split in a 50/50 manner into a By comparing CDCE− and ILR, the general improvement training set and a testing set. This splitting is based on app from using comment expansion is obvious. The expansion level, so the comments for the same app can only be in train- makes comments “smoother” among similar comments in ing or testing set altogether. The BOW feature vector is of feature level, and improves the performance of the clas- length 13, 135 by removing those words with less than 100 or sifiers against short comments and those with misspelled more than 1, 000, 000 appearance (in number of comments) words. For example, “ads” may sometimes be misspelled to in training set. Considering the hierarchy in label set, a be “add”, the expansion adds similar comments’ feature into comment labeled with a sub-label will be considered a pos- the underlining feature and the feature dimension with re- itive sample for a super-label as well. And a sample labeled spect to “ads” may not be zero anymore, hence the classifier with a super-label will not be considered either a positive can capture this feature and tend to put the label “Spam” or negative sample for a sub-label. For the baseline ILR, 5- on the comment. On the other side, the effect of expan- fold cross validation is adopted for finding λs in training set sion is not always positive for all samples. For example, for and L-BFGS quasi-Newton method [3] is applied for solving comments about a Game app, many ones may be talking the optimization problems. For comment expansion, tf-idf about “money” because they paid for some item in the game feature with cosine similarity is adopted for the retrieval but never got it. These comments should be labeled “Fi- model and Lemur1 as the actual tool for the retrieval. Time nance”. But one comments for the same app may be just constrain is enforced to prevent comment expansion with talking about how expensive that item is, hence not worth “future relevant” comments. Set constrain is applied to only the “money”. This one however, is not a CSPI. After ex- allow expansion within training/testing set respectively, and pansion, this comment will look much like the others hence the indexes of retrieval model are built for the two sets sep- be labeled “Finance” as well. Therefore a negative effect arately so that the model parameter like document number of comment expansion is that it may silence some different and IDF values won’t interfere between sets. Three types of voice that are making different points while using similar scope: All, app, Category are tested with each label and a words like others. Nevertheless, the general effect of com- 5-fold cross validation is used to pick the best of the three ment expansion is obviously positive. in training set for each label. Also, the mixture ratio α and the size of “relevant” document set |R| for the expansion are The difference between CDCE− and CDCE+ shows that also fixed by the 5-fold cross validation in training set for the post-process does not guarantee an improvement for all each label along with the scope. The size of |R| is selected labels. Hence CDCE∗ method is propose to pick a better from {1, 3, 5, 10}. For post-process with label correlation, model between CDCD− and CDCE+ for each label from the problem solving method is similar to the baseline ILR. training set. CDCD∗ provides the best performance among The metric for evaluation is micro F1 value. F1 value is the others. harmonic mean of Precision and Recall. And micro F1 is computed across all sample and all label at once. A label level comparison between CDCE∗ and ILR can be found in Table 4, where CDCE∗ appears to outperform ILR 5.2 Results and Analysis for all labels except After. The model behavior for label The comparison between the proposed CDCE method and After is different from others mostly due to lack of sam- the baseline in F1 value is shown in Table 3. ple. There are only 36 samples to be split into training and testing set, which practically makes the training of model The CDCE− method indicates the method using only com- insufficient and the comment expansion mostly uses com- ment expansion without the post-process. The CDCE+ ments with different labels as “relevant” comments. Besides method indicates the method using comment expansion and After, three labels:Before, System and Others do not pass post-process on all labels. And the CDCE∗ method indi- the statistical significant test at α level 0.05, with p-value cates the method that pick the models between CDCE− and 8.2%, 5.5% and 8.1% respectively. Other than these, the CDCE+ by cross-validation for each labels. This evaluation highest p-value is 0.9% from Finance. is based on 10 different training/testing sets splitting, and The results of picking better model between CDCE− and 1 http://www.lemurproject.org/ CDCE+ are also shown in the P.P. column of Table 4. It F1 overload and detection performance. Label P.P. Type ILR CDCE∗ General 0.7962 0.8037† No App 7. CONCLUSION Before 0.6965 0.7012 Yes Cat. In this paper, a supervised learning method is proposed to Execution 0.7277 0.7358† Yes App detect CSPI for Android apps. This task is formalized as Foreground 0.4347 0.4689† Yes Cat. a multi-label learning problem with a two dimensional la- Background 0.6674 0.6750† No App bel system with respect to “What” and “When” of issues After 0.1327 0.0882 No App reported in CSPI. A coarse filtering is first applied to nar- System 0.3991 0.4012 No App row down the set of comments as suspect. Then comment Privacy 0.7264 0.7350† No Cat. expansion is adopted to improve the representativity of the Spam 0.8181 0.8304† No Cat. feature by making convex combination of the original feature Finance 0.5238 0.5320† No Cat. with those of “relevant” comments. Finally, a post-process is Others 0.0235 0.0384 No App used upon some of the labels to make use of the label corre- lation for further improvement. Experiment results on the Table 4: Detailed comparison in label level. The † is com- collected dataset shows statistical significant improvement puted at α = 0.05 with one-tailed t test among 10 different in general against ILR as a baseline method. training/testing set splits. 8. REFERENCES [1] C. M. Bishop et al. Pattern recognition and machine appears that all Nature labels are not suitable for the post- learning, volume 1. springer New York, 2006. process, but the Scenario ones get a boost based on the [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent results in Table 3. The post-process makes use of the ILR dirichlet allocation. the Journal of machine Learning result as feature to predict a label. This prediction may research, 3:993–1022, 2003. be improved by getting correlation information from other [3] S. P. Boyd and L. Vandenberghe. Convex labels, but also suffer from poorly predicted results from optimization. Cambridge university press, 2004. ILR. Important words (features) for the Nature labels are not so diverse as those in Scenario. For example, for la- [4] N. Chen, J. Lin, S. C. Hoi, X. Xiao, and B. Zhang. bel Spam, no matter what Scenario label come with it, the Ar-miner: Mining informative reviews for developers comment would probably still use the words like “spam” or from mobile app marketplace. International “ads”. Similarly for Privacy there are “privacy”, “invade”, Conference on Software Engineering, 2014. or “permission”. To distinguish Scenario labels, however, [5] B. Fu, J. Lin, L. Li, C. Faloutsos, J. Hong, and different words are of importance under the condition of dif- N. Sadeh. Why people hate your app: making sense of ferent Nature label. If given Spam, Foreground is related to user feedback in a mobile app store. In Proceedings of over-sized on screen ads, and Background most likely to noti- the 19th ACM SIGKDD, pages 1276–1284. ACM, fication bar spamming. On the other hand, if given Privacy, 2013. Foreground may be related to phishing and Background to [6] L. V. Galvis Carreño and K. Winbladh. Analysis of personal information stealing or abuse. Hence the corre- user comments: an approach for software requirements lation information may be much more helpful for Scenario evolution. In Proceedings of the 2013 International labels than Nature ones. Conference on Software Engineering, pages 582–591. IEEE Press, 2013. The scopes of retrieval in comment expansion for each label [7] G. Madjarov, D. Kocev, D. Gjorgjevikj, and are listed in the Type column of Table 4. None of classi- S. Džeroski. An extensive experimental comparison of fiers choose to use All comments as candidates for “relevant” methods for multi-label learning. Pattern Recognition, comments, and the scope of using the comments of the same 45(9):3084–3104, 2012. app or same Category (Cat.) of apps are both popular. The [8] B. Quanz and J. Huan. Aligned graph classification scope of All comments makes the “relevant” comments too with regularized logistic regression. In SDM, pages diverse and noisy and the expansion normally lead to some 353–364. SIAM, 2009. unexpected result. One the other side, app and Cat. scope [9] J. Read, B. Pfahringer, G. Holmes, and E. Frank. serves well, providing much high probability of getting “rel- Classifier chains for multi-label classification. Machine evant” comments with both similar text content but also learning, 85(3):333–359, 2011. similar topic of issues. [10] H. T. T. Truong, E. Lagerspetz, P. Nurmi, A. J. Oliner, S. Tarkoma, N. Asokan, and S. Bhattacharya. 6. LIMITATION & FUTURE WORK The company you keep: Mobile malware infection As a comment level analysis, one limitation of this work is rates and inexpensive risk indicators. 2013. that it does not provide a risk assessment on the app level. [11] G. Tsoumakas and I. Katakis. Multi-label How to evaluate the app’s security/privacy risk base on the classification: An overview. International Journal of identification of CSPI would be an interesting work in the Data Warehousing and Mining (IJDWM), 3(3):1–13, future. Another limitation comes from the coarse filtering, 2007. where CSPI that do not contain any keywords could be left [12] J. Xu and W. B. Croft. Query expansion using local out by the method. This may due to the variety of language and global document analysis. In Proceedings of the itself or simply misspelling or using made-up words instead 19th ACM SIGIR Conference on Research and of the keywords. Hence further research may lie on how to Development in Information Retrieval, SIGIR ’96, expand the suspect set as a trade-off between computation pages 4–11, New York, NY, USA, 1996. ACM.