User Comment Analysis for Android apps and CSPI
               Detection with Comment Expansion

                 Lei Cen, Luo Si, Ninghui Li                                         Hongxia Jin
                  Computer Science Department                                  Samsung Information
                        Purdue University                                        Systems America
                  West Lafayette, IN 47907, USA                              San Jose, CA 95134, USA
                 {lcen, lsi, lin}@purdue.edu                          hongxia.jin@sisa.samsung.com

ABSTRACT                                                         user comments, including those which may not be so easy
Along with the exponential growth on markets of mobile           to be discovered from other sources (i.g. unresolved IAP
apps, comes the serious public concern about the security        issue). However, most of the negative comments are not
and privacy issues. User comments serves as a valuable           necessarily Comments with Security/Privacy Issues (CSPI).
source of information for evaluating a mobile app, for both      Some of them are non-informative comments [4], including
new users and developers. However, for the purpose of eval-      vague statement and pure emotional expressions. Others
uation on the security/privacy aspects of an app, user com-      may be complaining about attractiveness, quality or even
ments are not always directly useful. Most of the comments       cost of the app [5], which normally have nothing to do with
are about issues like functionality, missing feature or just     security/privacy. In order to make use of the comments to
pure emotional expression. Therefore, further efforts are        reveal the issues that are really related to security/privacy
required in order to identify those Comments with Secu-          of an app, CSPI need to be detected first to avoid all those
rity/Privacy Issues (CSPI) for future evaluation. In this        irrelevant comments. This paper provides a novel method to
paper, a dataset of comments is collected from Google Play,      deal with this problem. A set of comments are first collected
and a two dimensional label system is proposed to describe       and investigated. Based on observations from the comments,
those CSPI within it. A supervised multi-label learning          a two dimensional label system is designed to describe CSPI
method utilizing comment expansion is adopted to detect          from two different perspectives. With this label system, a
different types of CSPI described by this label system. Ex-      CSPI Detection with Comment Expansion (CDCE) method
periments on the collected dataset shows that the proposed       is proposed to first use keyword-based filtering method to
method outperforms the method without the comment ex-            narrow down the scale of comments in concern, then applies
pansion.                                                         a supervised multi-label learning method to identify different
                                                                 types of CSPI.

1.   INTRODUCTION                                                The contribution of this paper are listed as following:
New challenges come with the exponentially growing mar-
kets of mobile apps. On one side, comparing to traditional            • Instead of using a list of label for different issues, this
software markets, markets like Google Play and Apple Store              paper presents a two-dimensional label system, pic-
have lower entering threshold for developers and faster fi-             turing the “What” and “When” of the occurrence of a
nancial payback, hence greatly encourage more and more                  reported CSPI, providing an additional dimension to
developers to invest in this thriving business. One result              better understand the relationship between different
out of this is the huge amount of mobile apps with great di-            issues.
versity. Therefore, to control the quality of apps, especially        • A supervised learning method is proposed to solve this
the security risk of them across the whole markets becomes              multi-label problem. Comment expansion is adopted
a important issue to all that involved. On the other side,              to utilize the relationship between comments and a
public concerns about privacy issues with on-line activity              post-process for the relationship between labels. Both
and mobile phones are also elevating, demanding a mobile                relationships are proven to be useful in improving the
environment with more respect to users’ privacy.                        CSPI detection performance based on the collected
                                                                        dataset.
The infection rate of real malicious mobile apps over a mar-
ket can only be estimated. It is reported to be about 0.28%      The rest of the paper proceeds as follows. Section 2 de-
in [10]. The rest of them are assumed benign apps but are        scribes related works. Section 3 discuss the collection of
not free of security issues. Many misbehaviors of a mobile       data and the design of the label system. Section 4 presents
app may lead to real security/privacy issues. For example,       the method for CSPI detection. Section 5 demonstrates the
adding too much or inappropriate advertisement may lead to       evaluation process of the proposed method. Section 6 lists
phishing and scareware; unresolved In-App-Purchase (IAP)         the limitation of this paper with future work and Section 7
may lead to fraud; unnecessary running in background may         concludes the paper.
be connected with unauthorized access to personal data.

User comments are valuable feedback for both new users and
developers. Many security/privacy issues can be revealed by      2.     RELATED WORKS
User comment are utilized by some works [5, 6, 4] to eval-                                                                     and quality (or attractiveness for Game app) take the ma-
uate mobile apps. Some researchers [6] aim at the task                                                                         jority. Therefore it would be necessary to narrow down the
to extract new/changed requirement for new version of the                                                                      huge set of comment to a more feasible size for further anal-
app. It proposes Aspect and Sentiment Unification Model                                                                        ysis and annotation. In order to do that, a coarse filtering is
(ASUM) to extract the topics of comments. ASUM is an                                                                           applied first to create a set of suspect comments. This filter-
extension of Latent Dirichlet Allocation (LDA) [2] by incor-                                                                   ing is keyword-based, as any comment that contains at least
porating both topic modeling and sentiment analysis. An-                                                                       one of the predefined keywords will be put into the suspect
other work [5] uses LDA model to identify different topics                                                                     set. These keywords are manually picked in an iterative way.
from those negative comments, in order to provide insight                                                                      The initial set of keywords is just {security, privacy}. New
about why an app is getting low ratings. Our work focus on                                                                     keywords are picked from those that have high co-occurrence
a finer level to investigate only CSPI, which is part of the                                                                   probability with current keywords. Comments with the co-
negative comments. Also, some researchers [4] propose a                                                                        occurrence are investigated manually to see how often they
novel method for extracting informative review topics from                                                                     are related to security/privacy issues. The final keywords set
user comments. A filtering process is applied first to fil-                                                                    includes security, privacy, permission, money, spam, steal,
ter out non-informative comments, then LDA is adopted for                                                                      phish, etc.. To avoid mismatch, different forms of the key-
generating topics from the informative ones. Our work also                                                                     words are also considered, i.g. unsecure and insecure for
requires a filtering process, but not from the perspective of                                                                  security, stole and stolen for steal etc. A rating filtering
the quality of information, but whether the comment is re-                                                                     is also applied to only include those comments with poor
lated to security/privacy. All these works mentioned above                                                                     ratings(< 4.0). The resulting suspect set contains 36, 464
make use of topic models (LDA), which is an unsupervised                                                                       comments from 3, 174 apps.
method. For our work, the task is not to generate general
summarization of the topics of comments of an app, but to
identify a specific part (CSPI) of all the comments and fur-
                                                                                                                               3.1    CSPI Annotation
                                                                                                                               It is worth noting that the suspect set still contains not only
ther distinguish different types of it. Hence unsupervised
                                                                                                                               CSPI. Many of them are not CSPI, due to the simplicity of
method like LDA may not be able to generate the expected
                                                                                                                               the keyword-based coarse filtering. Hence the suspect set
and specified types of topics. Therefore, supervised method
                                                                                                                               only serves as a base candidate set of comments which are
is adopted in our work, and a label system is manually de-
                                                                                                                               suspect for CSPI. Further effort need to be done to actu-
signed to provide precise task for the learning process.
                                                                                                                               ally detect CSPI. And different types of CSPI need to be
                                                                                                                               distinguished and treated accordingly.
3.                       DATA COLLECTION
The data used in this work is collected by crawling the                                                                        Various different issues are found from the comments in the
Google Play website. Information about 6, 938 free apps                                                                        suspect set. It would be exhausting to distinguish each of
in Google play are downloaded during September through                                                                         them without abstract concepts by considering the relation-
December in 2013. For user comments, the total number of                                                                       ship between them. Also, it would be too vague to just
comments collected for these 6, 938 apps is 5, 108, 538. The                                                                   distinguish CSPI from non-CSPI comments without further
average number of comments for an app in this dataset is                                                                       insight of the actual issues. To make a trade-off between
736 and the maximum number is 4500. Figure 1a shows the                                                                        the complexity and functionality of the annotation system
distribution of number of comments among the apps.                                                                             for describing CSPI, a label General is firstly defined for
                                                                                                                               comments that is in general CSPI. In addition, a two di-
                                                                                             3×106                             mensional (Nature, Scenario) label set is defined as shown
                  4000
                                                                                            2.5×106
                                                                                                                               in Table 1 to provide finer level identification for different
                                                                                                                               CSPI topics. Among these labels, label Execution is a super
                  3000                                                                       2×106                             label for Foreground and Background. And all are sub-label
                                                                       Number of Comments
 Number of Apps


                                                                                            1.5×106
                                                                                                                               of label General.
                  2000

                                                                                                106                            The two dimensions of label set serve as “What” (Nature)
                  1000                                                                                                         and “When” (Scenario) of the underlining issues. The di-
                                                                                             5×105
                                                                                                                               mension Nature is used to identify the nature of the secu-
                     0
                      0      1000     2000      3000     4000   5000
                                                                                                 0
                                                                                                      1   2     3      4   5
                                                                                                                               rity/privacy related misbehavior of the app complained by
                                    Number of Comments                                                        Rating           the comments. These behaviors are normally described ex-
 (a) Histogram for the number                                          (b) Histogram for the num-                              plicitly in the comments. On the other hand, the dimension
 of apps against the number of                                         ber of comments with differ-                            Scenario is used to identify the scenario when the issues oc-
 its comments in the collected                                         ent ratings in the collected                            cur. This is sometimes expressed implicitly in the comments.
 dataset.                                                              dataset.                                                System, Privacy and Finance are clearly necessary as the di-
                                                                                                                               rect issues. The label Spam is assigned to the widely used
                          Figure 1: A simple view of the collected dataset.                                                    advertising behavior among apps. As a popular and com-
                                                                                                                               mon behavior, however, spamming is closely related some
The rating from user comments are highly skewed as shown                                                                       security issues like phishing, scareware etc., hence is also in-
in Figure 1b (this is previously also reported in [5]), indicat-                                                               cluded as one type of issue mentioned in CSPI. The label
ing that most of the comments are not complaining about an                                                                     Others is used for other issues not included in System, Pri-
issue. In addition, many of the comments with poor ratings                                                                     vacy, Spam or Finance. For example the topic about some
(< 4.0) are not really about security/privacy issues. As ob-                                                                   apps that can not be normally uninstalled, it is usually an
served from the dataset, complains about the functionality                                                                     app pre-installed by the vendor without any real issues, but
               Label                              Definition                                              Issues
              System            Issues causing negative effect to the system         Freezing the phone, unauthorized downloading
              Privacy     Issues about getting unauthorized access to user info.        Stealing phone number, accounts, emails
 Nature


               Spam               Issues about unpleasant ads and related.           Annoying ads, spam shortcut on home screen.
              Finance               Issues about suspect money stealing.                          unresolved purchase.
              Others           Security / privacy issues not included above.                         uninstall issues
              Before             Issues occur before installation of the app.               Requiring too much permissions.
 Scenario


             Execution    Issues occur when the app is executing on the phone.               General complains about spam
            Foreground   Issues occur when the app takes the foreground screen.         Ads occupies too much screen, phishing
            Background      Issues occur when the app is running background.                      Notification bar spam
               After               Issues occur after uninstall of the app.                 email/SMS spam after uninstall

                                                 Table 1: Two dimensional label set


it clearly violates the users’ control over the phone and is          example, the sample comment for Spam is also annotated
complained about by many users. Another example for Oth-              with Background, since it mentions that the spam appears
ers labeled comments would be those claims that the app               in the notification bar. And the one for Before is also anno-
is reported by other security apps for some reason. Since             tated with Privacy.
these are not direct opinions from the users but only a sus-
pects, they are put with Others, instead of with the other            4.    CSPI DETECTION
four ones according to the reasons said to be reported by             The purpose of CSPI detection is to detect certain types of
other security apps, which may not be true or even explic-            CSPI in the suspect set. This is naturally a multi-label clas-
itly expressed in the comment. The label Before is mostly             sification problem. Independent Logistic Regression (ILR)
assigned together with Privacy to indicating the complains            is used as a baseline. The proposed CDCE method adds
about the permission required by the app, which is available          two improvements upon it: Comment Expansion to embed
to the user before the installation. And the label After is           similarity between comments and Post-Process with label
mostly for the email/SMS spam after uninstall. The label              correlation to utilize label correlation.
Execution stands for the most common scenario, and it is
further divided into two sub-labels: Foreground for the is-           4.1    Feature Extraction
sues occur when the app is running in foreground (occupying           To represent comments for detection, Bag Of Word (BOW)
the screen) and Background for running in background. If              feature is extracted from comment text. Proper pre-processes
the issue in the comment is not specific about foreground             are applied to the text before feature extraction, including
or background, only the label Execution is applied, other-            removing stop words and stemming. The dimension of the
wise both Foreground (Background ) and Execution will be              BOW feature is tuned by removing the rarest and most pop-
applied.                                                              ular words. With the BOW feature and the annotation with
                                                                      the 11 labels, the suspect set can be represented as X and
The suspect set is then annotated manually with these la-             Y . X = {X1 , X2 , · · · , XN } is the set of the BOW feature
bels. Each label is validated by the agreement of two people.         vectors for each comments with N the number of comments.
Some statistics of the suspect set with respect to the labels         And Y = {Y1 , Y2 , · · · , YN } is the set of label vectors for each
are shown in Table 2. About 30% of the comment in the sus-            comments with Yi = {Yi1 , · · · , Yi11 } and Yij ∈ {−1, 1} with
pect set are labeled as CSPI. This also means many of the
                                                                      Yij = 1 indicating the presence of the jth label for comment
comments, while mentioning some keywords, are not really
                                                                      i.
talking about security/privacy issues. The sample comment
for Whole Set is one of them. It mentions stolen not to
claim a financial issue but a issue about stolen idea, which          4.2    Independent Logistic Regression (ILR)
is not about security or privacy. Some comments talk about            There are many works that have been done on multi-label
“money”, but only to express that the app is not worth it             learning from different perspective. G. Madjarov[7] made an
(Although all the apps we collected are free apps, people can         extensive empirical comparison of different multi-label learn-
still talk about In-App-Purchase and the paid version of the          ing methods. In the comparison, Binary Relevance (BR) [11]
app.). Another example would be the comments for the apps             performs similar to classifier chaining (CC) method [9] . Al-
about security (Anti-virus apps, password manager, etc.). A           though not the best method among all the competitors, BR
lot of keywords are mentioned in these comments since the             shows robust performance and has the advantage of sim-
apps are exactly about the issues, but the comments are not           plicity. For the task of CSPI detection, this paper does
necessarily reporting an issue of the apps. Similar things            not seek to investigate and improve the multi-label learn-
happen with Bank apps. Label Others has a very low per-               ing algorithm in general. Therefore, ILR is chosen as the
centage, indicating the coverage of the four main natures of          baseline, which is just BR framework with LR [1] as the
issues is good.                                                       base classifier. LR is a widely adopted linear classifier due
                                                                      to its robustness and simplicity. To apply LR to the multi-
It is worth noting that misspelling may hinder the perfor-            label problem considering independence between labels, one
mance of detection. The “baught” in the sample for Finance            LR classifier is trained for each of the labels. The objective
serves as an example. Also of important is the fact that a            function of ILR with 2-norm is as following:
CSPI usually has two or more labels besides General. For                           min N LL(X, wj , bj ) + λ(|wj |22 + |bj |22 )
                                                                                   wj ,bj
                      #             %                                            Sample Comment
 Whole Set       36, 464         100%                    “Idea stolen from ***, and it feels unfinished, or unprofessional”
   General       10, 636       29.17%                                             all of the below.
   System         1, 304        3.58%                   “It keeps crashing the phone. It becomes a device administrator.”
   Privacy        2, 513        6.89%      “Leak ing GPS location to dvertisers=Top 2 fastest ways to get me to hate your app/guts.”
    Spam          6, 164       16.90%                  “Since installing this app I get spam in my notifications constantly.”
   Finance        1, 191        3.27%                 “I baught $2.00 in *** and I never got them. In really mad about it.”
   Others            191        0.52%        “I buy my own phone with my own money and I cant delete this app from my phone?”
   Before         1, 611        4.40%                    “New versions permissions can steal all my data and contacts!!!”
  Execution       8, 456       23.19%                                “No need for them to invade my privacy.”
 Foreground       2, 011        5.52%       “game wont even load due to the pop ups before game starts..splash sceen pop ups suck”
 Background       4, 171       11.44%               “spam emails that I did not send were going out from my email address.”
    After             36        0.10%                      “six months after uninstall, *** Spam just keeps on coming”

                                       Table 2: Statistics and sample comment pieces on suspect set


with                                                                        long rank list from the retrieval engine will be returned and
                      N                                                     the scopes and restrictions will be applied to this list to pick
                                          j   T
                                                                            “relevant” comments afterwards.
                      X
N LL(X, wj , bj ) =         ln(1+exp−yi (wj Xi +bj ) ), j = 1, · · · , 11
                      i=1
                                                                            With a set of “relevant” comments, the comment expansion
where the N LL(X, wj , bj ) is the negative log-likelihood func-            can be conducted by making a convex sum of the original
tion. λ is the regularization parameter which can be ob-                    comment and the mean of the set of “relevant” comments on
tained by cross validation in training set. This optimization               feature level as following:
problem can be solved using gradient descent.
                                                                                                                            1 X
                                                                                            fnew = (1 − α)fold + α ·            f
4.3    Comment Expansion                                                                                                   |R|
                                                                                                                               f ∈R
User comments has some properties distinguishing them from
properly compiled documents. They are normally short,                       where fold is the BOW feature vector of the original com-
with wrongly spelled words (as mentioned in Section 3.1)                    ment, fnew the feature vector of the expanded comment,
and made-up words (i.g. “spamspamspamspam”). Also, dif-                     and R the set of “relevant” comments with respect to the
ferent words or phrases may be used for the same opinion.                   underlining comment. α serves as a tunable parameter for
These properties of user comments may harm the perfor-                      the degree of effect of the expansion, and will be tuned in
mance of CSPI detection since the features may not fully                    experiment for the best performance, so is the size of R.
represent the opinion of the comments. This is in a simi-
lar situation with the user query in Information Retrieval                  Comment expansion is an efficient way to utilize the relation-
(IR) problem, where a query submitted to search engine                      ship between comments. Other choices like kernel method
could also be short, misspelled and vary in the choice of                   may require the similarity computation between each pair
words. Due to this motivation, a traditional IR technique                   of comments, the amount of computation grows exponen-
called query expansion with pseudo relevant feedback [12]                   tially with the number of comments. Comment expansion,
is borrowed here to make comment expansion. Originally,                     on the other hand, relies on retrieval engine, and the cost of
this technique uses the top documents in the retrieval re-                  retrieval for one comment is normally constant, hence the
sult rank list with respect to the original query as “relevant”             total cost of time is linear to comment number.
documents, and use these document to expand the original
query, generating a new query resembling the “relevant” doc-
uments. This query expansion is reported to almost always                   4.4    Post-Process with Label Correlation
have a positive effect on the retrieval performance [12]. A                 Label correlation is used in various ways in multi-label learn-
similar process can be adopted for comment expansion as                     ing [7]. As mentioned in Section 3.1, labels hardly appear
following. The comment expansion would have two steps:                      alone. Hence the correlation between labels could be a valu-
                                                                            able source of information. For the CSPI detection task,
  1. Find “relevant” comments via retrieval.                                a simple post-process is applied to embed this information.
  2. Expand original comment with “relevant” comments.                      The post-process uses a second round of ILR with different
                                                                            feature for the comments. The probability output Ŷ of the
The relevance between comments is hereby evaluated us-                      first round ILR classifiers are used as feature in this second
ing the retrieval model. An interesting question would be                   round of ILR to predict Y . And the correlation matrix of
the scope of the retrieval. Should the candidate “relevant”                 the labels are utilized into a Laplacian norm [8] in these LR
comments picked from all other comments? or just from the                   problems. Let Ŷ = {Ŷi , · · · , YˆN } the estimated label vec-
comments from the same app or same category of apps? Be-                    tor set from the ILR. The objective function of the second
sides, the “relevant” comments should only be those posted                  round LR is as following:
before the underlining comment. To avoid making compli-
cated query to the retrieval engine, the query contains only                          min N LL(Ŷ , ŵj , bˆj ) + λ̂(ŵj T Lŵj + |bˆj |22 )
the underlining comment with no constrain. A sufficiently                            wˆj ,bˆj
with                                                                                                                micro F1
                                                                                        method
                          N
                                                                                                   General     Scenario Nature         All
                                            j
N LL(Ŷ , ŵj , bˆj ) =
                          X
                                ln(1+exp−yi (wˆj
                                                   T ˆ
                                                    Yi +bˆj )
                                                                ), j = 1, · · · , 11     ILR       0.7962      0.6713     0.7032       0.7153
                          i=1
                                                                                        CDCE−      0.8037†     0.6740†    0.7159†      0.7223†
                                                                                        CDCE+      0.8004†     0.6814†    0.7096†      0.7225†
where L = A−D is the Laplacian matrix with A the correla-                               CDCE∗      0.8037†     0.6836† 0.7159†         0.7263†
tion matrix of the labels Y in training set, and D a diagonal
matrix with its diagonal elements the sum of each row of A.                            Table 3: Experiment results on suspect set. † shows the
These optimization problems can be solved similarly using                              statistical significance based on ILR. It is computed over
gradient descent.                                                                      ten different random splits of the training/testing sets, using
                                                                                       one-tailed pair-wise t test with α = 0.05.
5. EXPERIMENTS
5.1 Experiment setting
The evaluation of the proposed CSPI detection method is                                the reported micro F1 values are the means over these ten
conducted under the suspect set of comments. As a su-                                  settings.
pervised method, a training set is required for training the
model. The suspect set is split in a 50/50 manner into a                               By comparing CDCE− and ILR, the general improvement
training set and a testing set. This splitting is based on app                         from using comment expansion is obvious. The expansion
level, so the comments for the same app can only be in train-                          makes comments “smoother” among similar comments in
ing or testing set altogether. The BOW feature vector is of                            feature level, and improves the performance of the clas-
length 13, 135 by removing those words with less than 100 or                           sifiers against short comments and those with misspelled
more than 1, 000, 000 appearance (in number of comments)                               words. For example, “ads” may sometimes be misspelled to
in training set. Considering the hierarchy in label set, a                             be “add”, the expansion adds similar comments’ feature into
comment labeled with a sub-label will be considered a pos-                             the underlining feature and the feature dimension with re-
itive sample for a super-label as well. And a sample labeled                           spect to “ads” may not be zero anymore, hence the classifier
with a super-label will not be considered either a positive                            can capture this feature and tend to put the label “Spam”
or negative sample for a sub-label. For the baseline ILR, 5-                           on the comment. On the other side, the effect of expan-
fold cross validation is adopted for finding λs in training set                        sion is not always positive for all samples. For example, for
and L-BFGS quasi-Newton method [3] is applied for solving                              comments about a Game app, many ones may be talking
the optimization problems. For comment expansion, tf-idf                               about “money” because they paid for some item in the game
feature with cosine similarity is adopted for the retrieval                            but never got it. These comments should be labeled “Fi-
model and Lemur1 as the actual tool for the retrieval. Time                            nance”. But one comments for the same app may be just
constrain is enforced to prevent comment expansion with                                talking about how expensive that item is, hence not worth
“future relevant” comments. Set constrain is applied to only                           the “money”. This one however, is not a CSPI. After ex-
allow expansion within training/testing set respectively, and                          pansion, this comment will look much like the others hence
the indexes of retrieval model are built for the two sets sep-                         be labeled “Finance” as well. Therefore a negative effect
arately so that the model parameter like document number                               of comment expansion is that it may silence some different
and IDF values won’t interfere between sets. Three types of                            voice that are making different points while using similar
scope: All, app, Category are tested with each label and a                             words like others. Nevertheless, the general effect of com-
5-fold cross validation is used to pick the best of the three                          ment expansion is obviously positive.
in training set for each label. Also, the mixture ratio α and
the size of “relevant” document set |R| for the expansion are                          The difference between CDCE− and CDCE+ shows that
also fixed by the 5-fold cross validation in training set for                          the post-process does not guarantee an improvement for all
each label along with the scope. The size of |R| is selected                           labels. Hence CDCE∗ method is propose to pick a better
from {1, 3, 5, 10}. For post-process with label correlation,                           model between CDCD− and CDCE+ for each label from
the problem solving method is similar to the baseline ILR.                             training set. CDCD∗ provides the best performance among
The metric for evaluation is micro F1 value. F1 value is the                           others.
harmonic mean of Precision and Recall. And micro F1 is
computed across all sample and all label at once.                                      A label level comparison between CDCE∗ and ILR can be
                                                                                       found in Table 4, where CDCE∗ appears to outperform ILR
5.2      Results and Analysis                                                          for all labels except After. The model behavior for label
The comparison between the proposed CDCE method and                                    After is different from others mostly due to lack of sam-
the baseline in F1 value is shown in Table 3.                                          ple. There are only 36 samples to be split into training and
                                                                                       testing set, which practically makes the training of model
The CDCE− method indicates the method using only com-                                  insufficient and the comment expansion mostly uses com-
ment expansion without the post-process. The CDCE+                                     ments with different labels as “relevant” comments. Besides
method indicates the method using comment expansion and                                After, three labels:Before, System and Others do not pass
post-process on all labels. And the CDCE∗ method indi-                                 the statistical significant test at α level 0.05, with p-value
cates the method that pick the models between CDCE− and                                8.2%, 5.5% and 8.1% respectively. Other than these, the
CDCE+ by cross-validation for each labels. This evaluation                             highest p-value is 0.9% from Finance.
is based on 10 different training/testing sets splitting, and
                                                                                       The results of picking better model between CDCE− and
1
    http://www.lemurproject.org/                                                       CDCE+ are also shown in the P.P. column of Table 4. It
                          F1                                    overload and detection performance.
     Label                            P.P.   Type
                    ILR     CDCE∗
   General       0.7962     0.8037†   No     App                7.   CONCLUSION
   Before        0.6965     0.7012    Yes    Cat.               In this paper, a supervised learning method is proposed to
  Execution      0.7277     0.7358†   Yes    App                detect CSPI for Android apps. This task is formalized as
 Foreground      0.4347     0.4689†   Yes    Cat.               a multi-label learning problem with a two dimensional la-
 Background      0.6674     0.6750†   No     App                bel system with respect to “What” and “When” of issues
    After       0.1327      0.0882    No     App                reported in CSPI. A coarse filtering is first applied to nar-
   System        0.3991     0.4012    No     App                row down the set of comments as suspect. Then comment
   Privacy       0.7264     0.7350†   No     Cat.               expansion is adopted to improve the representativity of the
    Spam         0.8181     0.8304†   No     Cat.               feature by making convex combination of the original feature
   Finance       0.5238     0.5320†   No     Cat.               with those of “relevant” comments. Finally, a post-process is
   Others        0.0235     0.0384    No     App                used upon some of the labels to make use of the label corre-
                                                                lation for further improvement. Experiment results on the
Table 4: Detailed comparison in label level. The † is com-      collected dataset shows statistical significant improvement
puted at α = 0.05 with one-tailed t test among 10 different     in general against ILR as a baseline method.
training/testing set splits.
                                                                8.   REFERENCES
                                                                 [1] C. M. Bishop et al. Pattern recognition and machine
appears that all Nature labels are not suitable for the post-        learning, volume 1. springer New York, 2006.
process, but the Scenario ones get a boost based on the
                                                                 [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
results in Table 3. The post-process makes use of the ILR
                                                                     dirichlet allocation. the Journal of machine Learning
result as feature to predict a label. This prediction may
                                                                     research, 3:993–1022, 2003.
be improved by getting correlation information from other
                                                                 [3] S. P. Boyd and L. Vandenberghe. Convex
labels, but also suffer from poorly predicted results from
                                                                     optimization. Cambridge university press, 2004.
ILR. Important words (features) for the Nature labels are
not so diverse as those in Scenario. For example, for la-        [4] N. Chen, J. Lin, S. C. Hoi, X. Xiao, and B. Zhang.
bel Spam, no matter what Scenario label come with it, the            Ar-miner: Mining informative reviews for developers
comment would probably still use the words like “spam” or            from mobile app marketplace. International
“ads”. Similarly for Privacy there are “privacy”, “invade”,          Conference on Software Engineering, 2014.
or “permission”. To distinguish Scenario labels, however,        [5] B. Fu, J. Lin, L. Li, C. Faloutsos, J. Hong, and
different words are of importance under the condition of dif-        N. Sadeh. Why people hate your app: making sense of
ferent Nature label. If given Spam, Foreground is related to         user feedback in a mobile app store. In Proceedings of
over-sized on screen ads, and Background most likely to noti-        the 19th ACM SIGKDD, pages 1276–1284. ACM,
fication bar spamming. On the other hand, if given Privacy,          2013.
Foreground may be related to phishing and Background to          [6] L. V. Galvis Carreño and K. Winbladh. Analysis of
personal information stealing or abuse. Hence the corre-             user comments: an approach for software requirements
lation information may be much more helpful for Scenario             evolution. In Proceedings of the 2013 International
labels than Nature ones.                                             Conference on Software Engineering, pages 582–591.
                                                                     IEEE Press, 2013.
The scopes of retrieval in comment expansion for each label      [7] G. Madjarov, D. Kocev, D. Gjorgjevikj, and
are listed in the Type column of Table 4. None of classi-            S. Džeroski. An extensive experimental comparison of
fiers choose to use All comments as candidates for “relevant”        methods for multi-label learning. Pattern Recognition,
comments, and the scope of using the comments of the same            45(9):3084–3104, 2012.
app or same Category (Cat.) of apps are both popular. The        [8] B. Quanz and J. Huan. Aligned graph classification
scope of All comments makes the “relevant” comments too              with regularized logistic regression. In SDM, pages
diverse and noisy and the expansion normally lead to some            353–364. SIAM, 2009.
unexpected result. One the other side, app and Cat. scope        [9] J. Read, B. Pfahringer, G. Holmes, and E. Frank.
serves well, providing much high probability of getting “rel-        Classifier chains for multi-label classification. Machine
evant” comments with both similar text content but also              learning, 85(3):333–359, 2011.
similar topic of issues.                                        [10] H. T. T. Truong, E. Lagerspetz, P. Nurmi, A. J.
                                                                     Oliner, S. Tarkoma, N. Asokan, and S. Bhattacharya.
6.   LIMITATION & FUTURE WORK                                        The company you keep: Mobile malware infection
As a comment level analysis, one limitation of this work is          rates and inexpensive risk indicators. 2013.
that it does not provide a risk assessment on the app level.    [11] G. Tsoumakas and I. Katakis. Multi-label
How to evaluate the app’s security/privacy risk base on the          classification: An overview. International Journal of
identification of CSPI would be an interesting work in the           Data Warehousing and Mining (IJDWM), 3(3):1–13,
future. Another limitation comes from the coarse filtering,          2007.
where CSPI that do not contain any keywords could be left       [12] J. Xu and W. B. Croft. Query expansion using local
out by the method. This may due to the variety of language           and global document analysis. In Proceedings of the
itself or simply misspelling or using made-up words instead          19th ACM SIGIR Conference on Research and
of the keywords. Hence further research may lie on how to            Development in Information Retrieval, SIGIR ’96,
expand the suspect set as a trade-off between computation            pages 4–11, New York, NY, USA, 1996. ACM.