<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>User Comment Analysis for Android apps and CSPI Detection with Comment Expansion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lei Cen, Luo Si, Ninghui Li</string-name>
          <email>{lcen, lsi, lin}@purdue.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongxia Jin</string-name>
          <email>hongxia.jin@sisa.samsung.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Purdue University</institution>
          ,
          <addr-line>West Lafayette, IN 47907</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samsung Information</institution>
          ,
          <addr-line>Systems America, San Jose, CA 95134</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Along with the exponential growth on markets of mobile apps, comes the serious public concern about the security and privacy issues. User comments serves as a valuable source of information for evaluating a mobile app, for both new users and developers. However, for the purpose of evaluation on the security/privacy aspects of an app, user comments are not always directly useful. Most of the comments are about issues like functionality, missing feature or just pure emotional expression. Therefore, further e orts are required in order to identify those Comments with Security/Privacy Issues (CSPI) for future evaluation. In this paper, a dataset of comments is collected from Google Play, and a two dimensional label system is proposed to describe those CSPI within it. A supervised multi-label learning method utilizing comment expansion is adopted to detect di erent types of CSPI described by this label system. Experiments on the collected dataset shows that the proposed method outperforms the method without the comment expansion.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The infection rate of real malicious mobile apps over a
market can only be estimated. It is reported to be about 0:28%
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The rest of them are assumed benign apps but are
not free of security issues. Many misbehaviors of a mobile
app may lead to real security/privacy issues. For example,
adding too much or inappropriate advertisement may lead to
phishing and scareware; unresolved In-App-Purchase (IAP)
may lead to fraud; unnecessary running in background may
be connected with unauthorized access to personal data.
User comments are valuable feedback for both new users and
developers. Many security/privacy issues can be revealed by
user comments, including those which may not be so easy
to be discovered from other sources (i.g. unresolved IAP
issue). However, most of the negative comments are not
necessarily Comments with Security/Privacy Issues (CSPI).
Some of them are non-informative comments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], including
vague statement and pure emotional expressions. Others
may be complaining about attractiveness, quality or even
cost of the app [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which normally have nothing to do with
security/privacy. In order to make use of the comments to
reveal the issues that are really related to security/privacy
of an app, CSPI need to be detected rst to avoid all those
irrelevant comments. This paper provides a novel method to
deal with this problem. A set of comments are rst collected
and investigated. Based on observations from the comments,
a two dimensional label system is designed to describe CSPI
from two di erent perspectives. With this label system, a
CSPI Detection with Comment Expansion (CDCE) method
is proposed to rst use keyword-based ltering method to
narrow down the scale of comments in concern, then applies
a supervised multi-label learning method to identify di erent
types of CSPI.
      </p>
      <p>The contribution of this paper are listed as following:
Instead of using a list of label for di erent issues, this
paper presents a two-dimensional label system,
picturing the \What" and \When" of the occurrence of a
reported CSPI, providing an additional dimension to
better understand the relationship between di erent
issues.</p>
      <p>A supervised learning method is proposed to solve this
multi-label problem. Comment expansion is adopted
to utilize the relationship between comments and a
post-process for the relationship between labels. Both
relationships are proven to be useful in improving the
CSPI detection performance based on the collected
dataset.</p>
      <p>
        The rest of the paper proceeds as follows. Section 2
describes related works. Section 3 discuss the collection of
data and the design of the label system. Section 4 presents
the method for CSPI detection. Section 5 demonstrates the
evaluation process of the proposed method. Section 6 lists
the limitation of this paper with future work and Section 7
concludes the paper.
2.
User comment are utilized by some works [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">5, 6, 4</xref>
        ] to
evaluate mobile apps. Some researchers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] aim at the task
to extract new/changed requirement for new version of the
app. It proposes Aspect and Sentiment Uni cation Model
(ASUM) to extract the topics of comments. ASUM is an
extension of Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by
incorporating both topic modeling and sentiment analysis.
Another work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] uses LDA model to identify di erent topics
from those negative comments, in order to provide insight
about why an app is getting low ratings. Our work focus on
a ner level to investigate only CSPI, which is part of the
negative comments. Also, some researchers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose a
novel method for extracting informative review topics from
user comments. A ltering process is applied rst to
lter out non-informative comments, then LDA is adopted for
generating topics from the informative ones. Our work also
requires a ltering process, but not from the perspective of
the quality of information, but whether the comment is
related to security/privacy. All these works mentioned above
make use of topic models (LDA), which is an unsupervised
method. For our work, the task is not to generate general
summarization of the topics of comments of an app, but to
identify a speci c part (CSPI) of all the comments and
further distinguish di erent types of it. Hence unsupervised
method like LDA may not be able to generate the expected
and speci ed types of topics. Therefore, supervised method
is adopted in our work, and a label system is manually
designed to provide precise task for the learning process.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. DATA COLLECTION</title>
      <p>The data used in this work is collected by crawling the
Google Play website. Information about 6; 938 free apps
in Google play are downloaded during September through
December in 2013. For user comments, the total number of
comments collected for these 6; 938 apps is 5; 108; 538. The
average number of comments for an app in this dataset is
736 and the maximum number is 4500. Figure 1a shows the
distribution of number of comments among the apps.
3×106
and quality (or attractiveness for Game app) take the
majority. Therefore it would be necessary to narrow down the
huge set of comment to a more feasible size for further
analysis and annotation. In order to do that, a coarse ltering is
applied rst to create a set of suspect comments. This
ltering is keyword-based, as any comment that contains at least
one of the prede ned keywords will be put into the suspect
set. These keywords are manually picked in an iterative way.
The initial set of keywords is just fsecurity; privacyg. New
keywords are picked from those that have high co-occurrence
probability with current keywords. Comments with the
cooccurrence are investigated manually to see how often they
are related to security/privacy issues. The nal keywords set
includes security, privacy, permission, money, spam, steal,
phish, etc.. To avoid mismatch, di erent forms of the
keywords are also considered, i.g. unsecure and insecure for
security, stole and stolen for steal etc. A rating ltering
is also applied to only include those comments with poor
ratings(&lt; 4:0). The resulting suspect set contains 36; 464
comments from 3; 174 apps.</p>
    </sec>
    <sec id="sec-3">
      <title>3.1 CSPI Annotation</title>
      <p>It is worth noting that the suspect set still contains not only
CSPI. Many of them are not CSPI, due to the simplicity of
the keyword-based coarse ltering. Hence the suspect set
only serves as a base candidate set of comments which are
suspect for CSPI. Further e ort need to be done to
actually detect CSPI. And di erent types of CSPI need to be
distinguished and treated accordingly.</p>
      <p>Various di erent issues are found from the comments in the
suspect set. It would be exhausting to distinguish each of
them without abstract concepts by considering the
relationship between them. Also, it would be too vague to just
distinguish CSPI from non-CSPI comments without further
insight of the actual issues. To make a trade-o between
the complexity and functionality of the annotation system
for describing CSPI, a label General is rstly de ned for
comments that is in general CSPI. In addition, a two
dimensional (Nature, Scenario) label set is de ned as shown
in Table 1 to provide ner level identi cation for di erent
CSPI topics. Among these labels, label Execution is a super
label for Foreground and Background. And all are sub-label
of label General.</p>
      <p>
        The two dimensions of label set serve as \What" (Nature)
and \When" (Scenario) of the underlining issues. The
dimension Nature is used to identify the nature of the
security/privacy related misbehavior of the app complained by
the comments. These behaviors are normally described
explicitly in the comments. On the other hand, the dimension
Scenario is used to identify the scenario when the issues
occur. This is sometimes expressed implicitly in the comments.
System, Privacy and Finance are clearly necessary as the
direct issues. The label Spam is assigned to the widely used
advertising behavior among apps. As a popular and
common behavior, however, spamming is closely related some
security issues like phishing, scareware etc., hence is also
included as one type of issue mentioned in CSPI. The label
Others is used for other issues not included in System,
Privacy, Spam or Finance. For example the topic about some
apps that can not be normally uninstalled, it is usually an
app pre-installed by the vendor without any real issues, but
1000
2000 3000
Number of Comments
The rating from user comments are highly skewed as shown
in Figure 1b (this is previously also reported in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]),
indicating that most of the comments are not complaining about an
issue. In addition, many of the comments with poor ratings
(&lt; 4:0) are not really about security/privacy issues. As
observed from the dataset, complains about the functionality
it clearly violates the users' control over the phone and is
complained about by many users. Another example for
Others labeled comments would be those claims that the app
is reported by other security apps for some reason. Since
these are not direct opinions from the users but only a
suspects, they are put with Others, instead of with the other
four ones according to the reasons said to be reported by
other security apps, which may not be true or even
explicitly expressed in the comment. The label Before is mostly
assigned together with Privacy to indicating the complains
about the permission required by the app, which is available
to the user before the installation. And the label After is
mostly for the email/SMS spam after uninstall. The label
Execution stands for the most common scenario, and it is
further divided into two sub-labels: Foreground for the
issues occur when the app is running in foreground (occupying
the screen) and Background for running in background. If
the issue in the comment is not speci c about foreground
or background, only the label Execution is applied,
otherwise both Foreground (Background ) and Execution will be
applied.
      </p>
      <p>The suspect set is then annotated manually with these
labels. Each label is validated by the agreement of two people.
Some statistics of the suspect set with respect to the labels
are shown in Table 2. About 30% of the comment in the
suspect set are labeled as CSPI. This also means many of the
comments, while mentioning some keywords, are not really
talking about security/privacy issues. The sample comment
for Whole Set is one of them. It mentions stolen not to
claim a nancial issue but a issue about stolen idea, which
is not about security or privacy. Some comments talk about
\money", but only to express that the app is not worth it
(Although all the apps we collected are free apps, people can
still talk about In-App-Purchase and the paid version of the
app.). Another example would be the comments for the apps
about security (Anti-virus apps, password manager, etc.). A
lot of keywords are mentioned in these comments since the
apps are exactly about the issues, but the comments are not
necessarily reporting an issue of the apps. Similar things
happen with Bank apps. Label Others has a very low
percentage, indicating the coverage of the four main natures of
issues is good.</p>
      <p>It is worth noting that misspelling may hinder the
performance of detection. The \baught" in the sample for Finance
serves as an example. Also of important is the fact that a
CSPI usually has two or more labels besides General. For
example, the sample comment for Spam is also annotated
with Background, since it mentions that the spam appears
in the noti cation bar. And the one for Before is also
annotated with Privacy.</p>
    </sec>
    <sec id="sec-4">
      <title>4. CSPI DETECTION</title>
      <p>The purpose of CSPI detection is to detect certain types of
CSPI in the suspect set. This is naturally a multi-label
classi cation problem. Independent Logistic Regression (ILR)
is used as a baseline. The proposed CDCE method adds
two improvements upon it: Comment Expansion to embed
similarity between comments and Post-Process with label
correlation to utilize label correlation.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1 Feature Extraction</title>
      <p>To represent comments for detection, Bag Of Word (BOW)
feature is extracted from comment text. Proper pre-processes
are applied to the text before feature extraction, including
removing stop words and stemming. The dimension of the
BOW feature is tuned by removing the rarest and most
popular words. With the BOW feature and the annotation with
the 11 labels, the suspect set can be represented as X and
Y . X = fX1; X2; ; XN g is the set of the BOW feature
vectors for each comments with N the number of comments.
And Y = fY1; Y2; ; YN g is the set of label vectors for each
comments with Yi = fYi1; ; Yi11g and Yij 2 f 1; 1g with
Yij = 1 indicating the presence of the jth label for comment
i.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2 Independent Logistic Regression (ILR)</title>
      <p>
        There are many works that have been done on multi-label
learning from di erent perspective. G. Madjarov[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] made an
extensive empirical comparison of di erent multi-label
learning methods. In the comparison, Binary Relevance (BR) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
performs similar to classi er chaining (CC) method [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] .
Although not the best method among all the competitors, BR
shows robust performance and has the advantage of
simplicity. For the task of CSPI detection, this paper does
not seek to investigate and improve the multi-label
learning algorithm in general. Therefore, ILR is chosen as the
baseline, which is just BR framework with LR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as the
base classi er. LR is a widely adopted linear classi er due
to its robustness and simplicity. To apply LR to the
multilabel problem considering independence between labels, one
LR classi er is trained for each of the labels. The objective
function of ILR with 2-norm is as following:
min N LL(X; wj; bj) + (jwjj22 + jbjj22)
wj;bj
where the N LL(X; wj; bj) is the negative log-likelihood
function. is the regularization parameter which can be
obtained by cross validation in training set. This optimization
problem can be solved using gradient descent.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4.3 Comment Expansion</title>
      <p>
        User comments has some properties distinguishing them from
properly compiled documents. They are normally short,
with wrongly spelled words (as mentioned in Section 3.1)
and made-up words (i.g. \spamspamspamspam"). Also,
different words or phrases may be used for the same opinion.
These properties of user comments may harm the
performance of CSPI detection since the features may not fully
represent the opinion of the comments. This is in a
similar situation with the user query in Information Retrieval
(IR) problem, where a query submitted to search engine
could also be short, misspelled and vary in the choice of
words. Due to this motivation, a traditional IR technique
called query expansion with pseudo relevant feedback [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
is borrowed here to make comment expansion. Originally,
this technique uses the top documents in the retrieval
result rank list with respect to the original query as \relevant"
documents, and use these document to expand the original
query, generating a new query resembling the \relevant"
documents. This query expansion is reported to almost always
have a positive e ect on the retrieval performance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A
similar process can be adopted for comment expansion as
following. The comment expansion would have two steps:
1. Find \relevant" comments via retrieval.
      </p>
      <p>2. Expand original comment with \relevant" comments.
The relevance between comments is hereby evaluated
using the retrieval model. An interesting question would be
the scope of the retrieval. Should the candidate \relevant"
comments picked from all other comments? or just from the
comments from the same app or same category of apps?
Besides, the \relevant" comments should only be those posted
before the underlining comment. To avoid making
complicated query to the retrieval engine, the query contains only
the underlining comment with no constrain. A su ciently
long rank list from the retrieval engine will be returned and
the scopes and restrictions will be applied to this list to pick
\relevant" comments afterwards.</p>
      <p>With a set of \relevant" comments, the comment expansion
can be conducted by making a convex sum of the original
comment and the mean of the set of \relevant" comments on
feature level as following:
fnew = (1
)fold +
1
jRj f2R</p>
      <p>X f
where fold is the BOW feature vector of the original
comment, fnew the feature vector of the expanded comment,
and R the set of \relevant" comments with respect to the
underlining comment. serves as a tunable parameter for
the degree of e ect of the expansion, and will be tuned in
experiment for the best performance, so is the size of R.
Comment expansion is an e cient way to utilize the
relationship between comments. Other choices like kernel method
may require the similarity computation between each pair
of comments, the amount of computation grows
exponentially with the number of comments. Comment expansion,
on the other hand, relies on retrieval engine, and the cost of
retrieval for one comment is normally constant, hence the
total cost of time is linear to comment number.</p>
    </sec>
    <sec id="sec-8">
      <title>4.4 Post-Process with Label Correlation</title>
      <p>
        Label correlation is used in various ways in multi-label
learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As mentioned in Section 3.1, labels hardly appear
alone. Hence the correlation between labels could be a
valuable source of information. For the CSPI detection task,
a simple post-process is applied to embed this information.
The post-process uses a second round of ILR with di erent
feature for the comments. The probability output Y^ of the
rst round ILR classi ers are used as feature in this second
round of ILR to predict Y . And the correlation matrix of
the labels are utilized into a Laplacian norm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in these LR
problems. Let Y^ = fY^i; ; Y^N g the estimated label
vector set from the ILR. The objective function of the second
round LR is as following:
min N LL(Y^ ; w^j; b^j) + ^(w^jT Lw^j + jb^jj22)
w^j;b^j
where L = A D is the Laplacian matrix with A the
correlation matrix of the labels Y in training set, and D a diagonal
matrix with its diagonal elements the sum of each row of A.
These optimization problems can be solved similarly using
gradient descent.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5. EXPERIMENTS</title>
    </sec>
    <sec id="sec-10">
      <title>5.1 Experiment setting</title>
      <p>
        The evaluation of the proposed CSPI detection method is
conducted under the suspect set of comments. As a
supervised method, a training set is required for training the
model. The suspect set is split in a 50=50 manner into a
training set and a testing set. This splitting is based on app
level, so the comments for the same app can only be in
training or testing set altogether. The BOW feature vector is of
length 13; 135 by removing those words with less than 100 or
more than 1; 000; 000 appearance (in number of comments)
in training set. Considering the hierarchy in label set, a
comment labeled with a sub-label will be considered a
positive sample for a super-label as well. And a sample labeled
with a super-label will not be considered either a positive
or negative sample for a sub-label. For the baseline ILR,
5fold cross validation is adopted for nding s in training set
and L-BFGS quasi-Newton method [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is applied for solving
the optimization problems. For comment expansion, tf-idf
feature with cosine similarity is adopted for the retrieval
model and Lemur1 as the actual tool for the retrieval. Time
constrain is enforced to prevent comment expansion with
\future relevant" comments. Set constrain is applied to only
allow expansion within training/testing set respectively, and
the indexes of retrieval model are built for the two sets
separately so that the model parameter like document number
and IDF values won't interfere between sets. Three types of
scope: All, app, Category are tested with each label and a
5-fold cross validation is used to pick the best of the three
in training set for each label. Also, the mixture ratio and
the size of \relevant" document set jRj for the expansion are
also xed by the 5-fold cross validation in training set for
each label along with the scope. The size of jRj is selected
from f1; 3; 5; 10g. For post-process with label correlation,
the problem solving method is similar to the baseline ILR.
The metric for evaluation is micro F1 value. F1 value is the
harmonic mean of Precision and Recall. And micro F1 is
computed across all sample and all label at once.
      </p>
    </sec>
    <sec id="sec-11">
      <title>5.2 Results and Analysis</title>
      <p>The comparison between the proposed CDCE method and
the baseline in F1 value is shown in Table 3.</p>
      <p>The CDCE method indicates the method using only
comment expansion without the post-process. The CDCE+
method indicates the method using comment expansion and
post-process on all labels. And the CDCE method
indicates the method that pick the models between CDCE and
CDCE+ by cross-validation for each labels. This evaluation
is based on 10 di erent training/testing sets splitting, and
1http://www.lemurproject.org/
method</p>
      <p>ILR
CDCE
CDCE+
CDCE
the reported micro F1 values are the means over these ten
settings.</p>
      <p>By comparing CDCE and ILR, the general improvement
from using comment expansion is obvious. The expansion
makes comments \smoother" among similar comments in
feature level, and improves the performance of the
classi ers against short comments and those with misspelled
words. For example, \ads" may sometimes be misspelled to
be \add", the expansion adds similar comments' feature into
the underlining feature and the feature dimension with
respect to \ads" may not be zero anymore, hence the classi er
can capture this feature and tend to put the label \Spam"
on the comment. On the other side, the e ect of
expansion is not always positive for all samples. For example, for
comments about a Game app, many ones may be talking
about \money" because they paid for some item in the game
but never got it. These comments should be labeled
\Finance". But one comments for the same app may be just
talking about how expensive that item is, hence not worth
the \money". This one however, is not a CSPI. After
expansion, this comment will look much like the others hence
be labeled \Finance" as well. Therefore a negative e ect
of comment expansion is that it may silence some di erent
voice that are making di erent points while using similar
words like others. Nevertheless, the general e ect of
comment expansion is obviously positive.</p>
      <p>The di erence between CDCE and CDCE+ shows that
the post-process does not guarantee an improvement for all
labels. Hence CDCE method is propose to pick a better
model between CDCD and CDCE+ for each label from
training set. CDCD provides the best performance among
others.</p>
      <p>A label level comparison between CDCE and ILR can be
found in Table 4, where CDCE appears to outperform ILR
for all labels except After. The model behavior for label
After is di erent from others mostly due to lack of
sample. There are only 36 samples to be split into training and
testing set, which practically makes the training of model
insu cient and the comment expansion mostly uses
comments with di erent labels as \relevant" comments. Besides
After, three labels:Before, System and Others do not pass
the statistical signi cant test at level 0:05, with p-value
8:2%, 5:5% and 8:1% respectively. Other than these, the
highest p-value is 0:9% from Finance.</p>
      <p>The results of picking better model between CDCE and
CDCE+ are also shown in the P.P. column of Table 4. It
General</p>
      <p>Before
Execution
Foreground
Background</p>
      <p>After
System
Privacy</p>
      <p>Spam
Finance
Others</p>
      <p>Type
No
Yes
Yes
Yes
No
No
No
No
No
No
No</p>
      <p>App
Cat.</p>
      <p>App
Cat.</p>
      <p>App
App
App
Cat.</p>
      <p>Cat.</p>
      <p>Cat.</p>
      <p>App
appears that all Nature labels are not suitable for the
postprocess, but the Scenario ones get a boost based on the
results in Table 3. The post-process makes use of the ILR
result as feature to predict a label. This prediction may
be improved by getting correlation information from other
labels, but also su er from poorly predicted results from
ILR. Important words (features) for the Nature labels are
not so diverse as those in Scenario. For example, for
label Spam, no matter what Scenario label come with it, the
comment would probably still use the words like \spam" or
\ads". Similarly for Privacy there are \privacy", \invade",
or \permission". To distinguish Scenario labels, however,
di erent words are of importance under the condition of
different Nature label. If given Spam, Foreground is related to
over-sized on screen ads, and Background most likely to
notication bar spamming. On the other hand, if given Privacy,
Foreground may be related to phishing and Background to
personal information stealing or abuse. Hence the
correlation information may be much more helpful for Scenario
labels than Nature ones.</p>
      <p>The scopes of retrieval in comment expansion for each label
are listed in the Type column of Table 4. None of
classiers choose to use All comments as candidates for \relevant"
comments, and the scope of using the comments of the same
app or same Category (Cat.) of apps are both popular. The
scope of All comments makes the \relevant" comments too
diverse and noisy and the expansion normally lead to some
unexpected result. One the other side, app and Cat. scope
serves well, providing much high probability of getting
\relevant" comments with both similar text content but also
similar topic of issues.</p>
    </sec>
    <sec id="sec-12">
      <title>6. LIMITATION &amp; FUTURE WORK</title>
      <p>As a comment level analysis, one limitation of this work is
that it does not provide a risk assessment on the app level.
How to evaluate the app's security/privacy risk base on the
identi cation of CSPI would be an interesting work in the
future. Another limitation comes from the coarse ltering,
where CSPI that do not contain any keywords could be left
out by the method. This may due to the variety of language
itself or simply misspelling or using made-up words instead
of the keywords. Hence further research may lie on how to
expand the suspect set as a trade-o between computation
overload and detection performance.</p>
    </sec>
    <sec id="sec-13">
      <title>7. CONCLUSION</title>
      <p>In this paper, a supervised learning method is proposed to
detect CSPI for Android apps. This task is formalized as
a multi-label learning problem with a two dimensional
label system with respect to \What" and \When" of issues
reported in CSPI. A coarse ltering is rst applied to
narrow down the set of comments as suspect. Then comment
expansion is adopted to improve the representativity of the
feature by making convex combination of the original feature
with those of \relevant" comments. Finally, a post-process is
used upon some of the labels to make use of the label
correlation for further improvement. Experiment results on the
collected dataset shows statistical signi cant improvement
in general against ILR as a baseline method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Bishop</surname>
          </string-name>
          et al.
          <article-title>Pattern recognition and machine learning</article-title>
          , volume
          <volume>1</volume>
          . springer New York,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Boyd</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Vandenberghe</surname>
          </string-name>
          . Convex optimization. Cambridge university press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang.</surname>
          </string-name>
          Ar-miner:
          <article-title>Mining informative reviews for developers from mobile app marketplace</article-title>
          .
          <source>International Conference on Software Engineering</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Sadeh</surname>
          </string-name>
          .
          <article-title>Why people hate your app: making sense of user feedback in a mobile app store</article-title>
          .
          <source>In Proceedings of the 19th ACM SIGKDD</source>
          , pages
          <volume>1276</volume>
          {
          <fpage>1284</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Galvis</surname>
          </string-name>
          <article-title>Carren~o and K. Winbladh</article-title>
          .
          <article-title>Analysis of user comments: an approach for software requirements evolution</article-title>
          .
          <source>In Proceedings of the 2013 International Conference on Software Engineering</source>
          , pages
          <volume>582</volume>
          {
          <fpage>591</fpage>
          . IEEE Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Madjarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kocev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gjorgjevikj</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Dzeroski.</surname>
          </string-name>
          <article-title>An extensive experimental comparison of methods for multi-label learning</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>45</volume>
          (
          <issue>9</issue>
          ):
          <volume>3084</volume>
          {
          <fpage>3104</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Quanz</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Huan</surname>
          </string-name>
          .
          <article-title>Aligned graph classi cation with regularized logistic regression</article-title>
          .
          <source>In SDM</source>
          , pages
          <volume>353</volume>
          {
          <fpage>364</fpage>
          .
          <string-name>
            <surname>SIAM</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Read</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          , G. Holmes, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Frank</surname>
          </string-name>
          .
          <article-title>Classi er chains for multi-label classi cation</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>85</volume>
          (
          <issue>3</issue>
          ):
          <volume>333</volume>
          {
          <fpage>359</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. T. T.</given-names>
            <surname>Truong</surname>
          </string-name>
          , E. Lagerspetz,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nurmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Oliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tarkoma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Asokan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          .
          <article-title>The company you keep: Mobile malware infection rates and inexpensive risk indicators</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsoumakas</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.</given-names>
            <surname>Katakis</surname>
          </string-name>
          <article-title>. Multi-label classi cation: An overview</article-title>
          .
          <source>International Journal of Data Warehousing and Mining (IJDWM)</source>
          ,
          <volume>3</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>13</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Query expansion using local and global document analysis</article-title>
          .
          <source>In Proceedings of the 19th ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96</source>
          , pages
          <fpage>4</fpage>
          {
          <fpage>11</fpage>
          , New York, NY, USA,
          <year>1996</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>