Explaining Unwanted Behaviours in Context

           Wei Chen                           David Aspinall                            Andrew D. Gordon
    University of Edinburgh               University of Edinburgh                 Microsoft Research Cambridge
     wchen2@inf.ed.ac.uk                  David.Aspinall@ed.ac.uk                    University of Edinburgh
                                                                                      Andy.Gordon@ed.ac.uk
                               Charles Sutton                                 Igor Muttik
                           University of Edinburgh                           Intel Security
                            csutton@inf.ed.ac.uk                        igor.muttik@intel.com


                                                                    cally producing a short paragraph to explain unwanted
                                                                    behaviours.
                       Abstract                                        A naive method is to: train a linear classifier from
                                                                    a collection of identified malware instances and benign
    Mobile malware has been increasingly identi-                    apps, choose features with top weights assigned by this
    fied based on unwanted behaviours like send-                    classifier, then process selected features through tem-
    ing premium SMS messages. However, un-                          plates to output text. This method has been adopted
    wanted behaviours for a group of apps can                       in research, e.g., the Drebin system [A+ 14a].
    be normal for another, i.e., they are context-                     However, by greedily choosing features to output
    sensitive. We develop an approach to au-                        text, the generated explanations are inaccurate. This
    tomatically explain unwanted behaviours in                      is mainly because unwanted behaviours of mobile apps
    context and evaluate the automatic explana-                     are context-sensitive, i.e., an unwanted behaviour in
    tions via a user-study with favourable results.                 one group of apps can be normal in another. For exam-
    These explanations not only state whether an                    ple, collecting locations is normal for jogging tracker
    app is malware but also elaborate how and in                    apps, but unwanted for card game apps.
    what kind of context a decision was made.                          Instead, our new approach is to organise sample
                                                                    apps into fine-grained groups by their behavioural sim-
1    Introduction                                                   ilarity. We set the context of an app in question to the
                                                                    group whose members’ behaviours are the most simi-
Researchers and malware analysts have identified
                                                                    lar to this app’s behaviours. By exploiting behavioural
hundreds and thousands of mobile apps as mal-
                                                                    difference between malware and benign apps in this
ware [EOMC11, ZJ12] and organised them into fami-
                                                                    context, we decide whether the target app is malware,
lies based on some unwanted behaviours, e.g., steal-
                                                                    and if so, we produce an explanation. Here are two
ing personal information, accessing locations, col-
                                                                    example automatic explanations.
lecting contacts information, sending premium mes-
sages constantly, etc. However, except for some mal-                   a. This app is like a chatting app, but, after a USB
ware analysis reports of several famous malware fam-                massive storage is connected, it will: retrieve a class in
ilies [ZJ12, S+ 13], e.g., Geinimi, Basebridge, Spitmo,             a runnable package; read information about networks;
Zitmo, Ginmaster, Ggtracker, Droidkungfu, etc., peo-                connect to Internet.
ple don’t know what kind of behaviour makes a mobile                   b. This app is like an anti-virus app, but, it will:
app bad. This suggests a research problem: automati-                read your phone state after a phone call is made; read
                                                                    your phone state then connect to Internet; send SMS
    Copyright c by the paper’s authors. Copying permitted for       messages after a phone call is made.
private and academic purposes. This volume is published and         These explanations not only elaborate which be-
copyrighted by its editors.
                                                                    haviour is unwanted but also give the context, e.g.,
    In: D. Aspinall, L. Cavallaro, M. N. Seghir, M. Volkamer
(eds.): Proceedings of the Workshop on Innovations in Mobile
                                                                    chatting and anti-virus, in which a decision was made.
Privacy and Security IMPS at ESSoS’16, London, UK, 06-April-        Here, the context name is from the known category
2016, published at http://ceur-ws.org                               names of apps in the same group by majority voting.


                                                           38
                                                                1
  Our approach combines static analysis, clustering,              A recent prototype DescribeMe [ZDFY15] generates
supervised-learning, and text mining techniques, and           text from data-flows by feeding features through hand-
proceeds as follows.                                           built templates. The main drawback is its scalablility:
                                                               to produce data-flows is too expensive for most apps.
  • Formalisation. We approximate the behaviour                   The idea of context is similar with the cluster used
    of an Android app by an extended call-graph, i.e.,         in the tool CHABADA [GTGZ14]. This tool detects
    a collection of finite control-sequences of events,        the outliers (abnormal API usage) within the clusters
    actions, and annotated API calls. From this graph          of apps by using OC-SVM (one-class SVM). These
    we extract the happen-before feature which de-             clusters were grouped by the descriptions of apps using
    notes that something happens before another.               LDA (Latent Dirichlet Allocation). However, for most
                                                               sample apps which were collected from alternative An-
  • Learning. We organise sample apps into groups              droid markets, e.g., Wandoujia, Baidu, and Tencent in
    using clustering methods, and characterise un-             China, it is hard to get their descriptions and these de-
    wanted behaviours for each group by exploring              scriptions are often written in different languages.
    the difference between malware instances and be-              The extended call-graphs are much more accurate
    nign apps within the same group.                           than the manifest information, e.g., permissions and
                                                               actions, which were often used as input features for
  • Explanation. We decide whether a target app                malware detection or mitigation [BKvOS10, EOM09,
    is malware by choosing a group then checking               F+ 11]. Compared with a simple list of API calls ap-
    against this group whether the app has any un-             pearing in the code, the extended call-graph can cap-
    wanted behaviour, i.e., a behaviour exhibited by a         ture more sophisticated behaviours. This is needed
    malware instance in the group. The correspond-             in practice, because: API calls appearing in the
    ing features are fed through hand-built templates          code contain “noise” caused by the dead code and
    to produce text as explanations.                           libraries [ADY13]; and, some unwanted behaviours
                                                               only arise when some API methods are called in cer-
The main contributions of this paper are to:
                                                               tain orders [C+ 13, KB15, Y+ 14]. On the other hand,
  - show that the happen-before feature is an appro-           call-graphs are less accurate than models which cap-
    priate abstract of app behaviours with respect to          ture data-flows. But, it is much easier to generate
    learning and explaining;                                   the extended call-graphs using our tool for apps en
                                                               masse than generating data-flows using tools like Flow-
  - introduce the context into behaviour explanations          Droid [A+ 14b] or Amandroid [WROR14]. In partic-
    and develop a clustering-based algorithm (Fig-             ular, people can annotate appealing API methods to
    ure 2) to organise sample apps into groups and             generate compact graphs more efficiently, rather than
    construct unwanted behaviours for each group;              considering all data-dependence between statements.

  - demonstrate that the automatic explanation in              2   Characterising App Behaviours
    context produces more convincing and desirable
    results than several other candidate methods by            We use a simplified synthetic example to illustrate the
    surveying general users.                                   characterisation of app behaviours. It is an Android
                                                               app which constantly sends out the device ID and the
                                                               phone number by SMS messages in the background
1.1   Related Work
                                                               when an incoming SMS message is received.
To automatically detect Android malware, machine                  We approximate its behaviour by using the graph
learning methods have been applied to train classi-            in Figure 1. It tells us: this app has two entries
fiers [ADY13, GYAR13, GTGZ14, YSMM13]. All of                  which are respectively specified by actions MAIN and
them were to obtain good fits to the training data by          SMS RECEIVED; it will collect the device ID and the
trying different methods and features. Explanations of         phone number in a Broadcast Receiver, then send SMS
chosen features have received much less consideration.         messages out in an AsyncTask; the behaviour of send-
    The tool Drebin [A+ 14a] is the first attempt to au-       ing SMS messages can also be triggered by an interac-
tomatically generate explanations of Android malware.          tion from the user, e.g., clicking a button, touching the
It generates explanations by choosing features with top        screen, long-pressing a picture, etc., which is denoted
weights from a linear classifier then processes them           by the word “click”.
through hand-built templates to output text. A broad              This graph is a collection of finite control-sequences
range of syntax-based features, e.g., permissions, API         of actions, events, and annotated API calls, which is
calls, intents, URLs, etc., were collected for training.       constructed from the bytecode of an Android app. Ac-


                                                      39
                                                           2
                                                          click                          AsyncTask: sendTextMessage

                                                                    AsyncTask: sendTextMessage
                                                                                                          
                           / •           MAIN             / •                                       / •
                                                            O
                                 SMS RECEIVED                   Receiver: getLine1Number
                             
                            •                             / •
                                  Receiver: getDeviceId


                                      Figure 1: An example extended call-graph.

tions reflect what happens in the environment and                            a weather forecast app accesses his or her locations,
what kind of service an app requests for, e.g., an in-                       but might feel uncomfortable if a messaging app does
coming message is received, the device finishes boot-                        so. Therefore, to understand and explain unwanted
ing, the app wants to send an email by using the ser-                        behaviour, we need a notion of context.
vice supplied by an email-client, etc. Events denote
the interaction from the user, e.g., clicking a picture,                     3.1   Constructing Context
pressing a button, scrolling down the screen, etc. An-
notated API calls tell us whether the app does any-                          Unwanted behaviours in general only account for a
thing we are interested in. For instance, getDeviceID,                       small part of a malicious app’s activities. This is
getLine1Number, and sendTextMessage are annotated                            by design: malicious apps seek to hide their bad
API calls in the above example.                                              behaviours, and are often constructed by repackag-
   To construct such a graph directly from the byte-                         ing benign applications [Z+ 14, Z+ 13]. This obser-
code, we have to model complex real-world features                           vation gives us a notion of context: we group to-
of the Android framework, including: inter-procedural                        gether apps, benign or malicious, whose behaviours
calls, callbacks, component life-cycles, permissions, ac-                    are mostly the same. Then, within the context, we
tions, events, inter-component communications, multi-                        distinguish unwanted from normal behaviours by ex-
ple threads, multiple entries, interfaces, nested classes,                   ploring features which are mostly associated with mal-
and runtime-registered listeners. We don’t model reg-                        ware. This produces a fine-grained, behavioural notion
isters, fields, assignments, operators, pointer-aliases,                     of context, that is more discriminating than categories,
arrays or exceptions. The choice of which aspect to                          e.g., GAME, TOOLS, and WEATHER, etc., or clus-
model is a trade-off between efficiency and precision.                       ters produced from developer-written textual descrip-
   In our implementation, we use an extension                                tions [GTGZ14].
of permission-governed API methods generated by                                  We formalise this idea in Figure 2. Sample apps are
PScout [AZHL12] as annotations. The Android plat-                            organised into groups. Apps in the same group share
form tools aapt and dexdump are respectively used to                         common behaviours, in the sense that their feature
extract the manifest information and to decompile the                        vectors are similar. Ideally, repackaged apps will be
bytecode into the assembly code, from which we con-                          in the same group with the original benign apps. In
struct the extended call-graph.                                              practice, a group might consist of only benign apps or
   Once the extended call-graphs are constructed, we                         only malware. This depends on the feature used for
can extract features for the purpose of learning un-                         clustering and its distribution in sample apps.
wanted behaviours. In particular, we extract pairs of                            Two sets of features are constructed for each group:
edge labels occurring in sequence, i.e., denoting that                       normal and unwanted. The normal set is the union of
something happens before another, so-called happen-                          all behaviours of benign apps. The unwanted set con-
befores. Generally, one can extract n-tuples. But, in                        sists of abnormal behaviours of malware, that is, the
practice, we found that constructing triples was al-                         relative complement of the normal set in the collection
ready too expensive: the order of magnitude for the                          of behaviours of malware instances.
average number of triples in a typical extended call-                            The rule behind this construction is: a benign app
graph is 104 .                                                               can not have any unwanted behaviour and a malware
                                                                             instance must have some unwanted behaviour whatever
                                                                             its other behaviours are. Every sample app in the same
3    Learning Unwanted Behaviours
                                                                             group is required to follow this rule. Otherwise, there
A behaviour that is unwanted for one kind of app                             is a conflict in the group. To solve this conflict, we split
can be innocuous for another. For example, sending                           the group into two disjoint subgroups. Then, the above
SMS messages is normal for messaging apps, but un-                           construction will be done respectively on subgroups
wanted for an E-reader app; a user might expect that                         until all conflicts are solved.


                                                                    40
                                                                         3
 Function construct context (group)
 Input: a group of malware and benign applications                      sults in: it is hard to train a classifier for each group us-
 Output: fine-grained groups with normal and unwanted features.         ing classical learning methods, e.g., SVM, naive Bayes,
 G ← {group}
 P ← {}                                                                 and logistic linear regression. Therefore, we calculate
 has conflict ← True                                                    the distances between the target app and each group.
 while has conflict do
    has conflict ← False                                                The closest group is chosen as the context. Then, we
    for group in G do                                                   decide whether the target app is malware by applying
       normal, unwanted ← collect behaviour (group)
       if detect conflict (group, normal, unwanted) then                the following logic rules:
            group a, group b ← split group (group)
            G = (G − {group}) ∪ {group a, group b}                        • Conservatively normal. The target app is
            has conflict ← True
       else                                                                 classified as benign if it has no unwanted be-
            G = G − {group}                                                 haviour and all its behaviours are normal, i.e.,
            P = P ∪ {(group, normal, unwanted)})
       end if                                                               feature(app) ⊆ normal.
    end for
 end while                                                                • Aggressively malicious. The target app is clas-
 return P
                                                                            sified as malicious if one of its behaviours is un-
 Function collect behaviour (group)                                         wanted, i.e., feature(app) ∩ unwanted 6= ∅.
 normal ← {}
 unwanted ← {}
 for app in group do                                                      • Neutrally suspicious. If the target app has no
    if app is benign then                                                   unwanted behaviour but some of its behaviours
        normal = normal ∪ feature(app)
        unwanted = unwanted − normal                                        are not normal. We consider its abnormal be-
    else                                                                    haviours, i.e., feature(app)−normal, as suspicious
        unwanted = (unwanted ∪ feature(app)) − normal
    end if                                                                  and label it as unknown. That is, according to
 end for                                                                    current knowledge we can not decide whether it is
 return normal, unwanted
                                                                            malware. The decision have to be postponed until
 Function detect conflict (group, normal, unwanted)                         more sample apps of this group are acquired.
 for app in group do
    if app is benign and feature(app) 6⊆ normal then
        return True                                                         We randomly chose 1, 000 apps with benign and ma-
    end if                                                              licious half-and-half as the training set; and an equal
    if app is malicious and feature(app) ∩ unwanted = ∅ then
        return True                                                     number of apps as the testing set. They contain some
    end if                                                              famous benign apps, i.e., Google Talk, Amazon Kin-
 end for
 return False                                                           dle, Youtube, Facebook, etc., and some instances in fa-
                                                                        mous malware families, e.g., DroidKungfu, Plankton,
      Figure 2: Context and unwanted behaviours.                        Zitmo, etc. These apps spread in around 30 categories
                                                                        from ARCADE GAME to WEATHER. Many adver-
   The process starts in the function construct context                 tisement libraries were also found in these apps, e.g.,
which is invoked on the whole collection of sample                      Admob, Millennial Media, Airpush, etc.
apps. When the algorithm terminates the following                           To compare our classification method with general
property is satisfied: for each app in a group, if it is                classifiers, we train a classifier using an implementa-
malware then feature(app) ∩ unwanted 6= ∅; if it is                     tion liblinear [F+ 08] of L1-Regularized Logistic Re-
benign then feature(app) ⊆ normal.                                      gression [Tib94] (abbreviated as L1LR). We apply our
   The function split group splits a group of apps into                 method to construct context and collect unwanted
two disjoint subgroups. Many implementations are                        behaviours from happen-befores which are extracted
possible. We adopt the hierarchical clustering method                   from the extended call-graphs of apps in the training
to group apps. The cosine dissimilarity between fea-                    set. Further, we apply the logic rules discussed ear-
ture vectors is calculated and the average-linkage is                   lier, to decide whether a target app is malware against
used to calculate the distances between clusters.                       unwanted behaviours for a chosen group.
   To illustrate the notion of context we constructed                       We report the classification performance as follows.
unwanted behaviours of 400 randomly-chosen sample
                                                                                        Edge Labels in Graphs   Happen-Befores
apps by using the above method. The ten biggest gen-                       Classifier
                                                                                        Precision    Recall     Precision Recall
erated groups are given in Table 1.                                         context       83%         88%         80%      92%
                                                                             L1LR         83%         89%         85%      88%
3.2    Classification
                                                                        It shows that for different features the classification
We want to decide whether an app in question is mal-                    performance of our method is only slightly worse than
ware, by using the constructed context and unwanted                     L1LR, with no more than a 5% drop in precision.
behaviours. The size and the portion of malware vary                    This is because some apps are labelled as unknown
largely across groups, as shown in Table 1. This re-                    in our method. We can achieve better classification


                                                               41
                                                                    4
    Group    Size   %Malware    #Normal       #Unwanted       Top Malware Family                   Top Category
      0      163     93.25       6825           36813            Geinimi Fakerun         ENTERTAINMENT PERSONALIZATION
      5       24     100.0          0            2611           Basebridge Spitmo        COMMUNICATION MUSIC AND AUDIO
     21       23     39.13       3306            734          Plankton Droidkungfu            WEATHER PHOTOGRAPHY
      7       17     41.18       1466            295                unknown                  WEATHER TRANSPORTATION
     19       13     15.38       1396             77                  Adrd                    COMMUNICATION TOOLS
     12       10      10.0       2027             39                  Adrd             MUSIC AND AUDIO NEWS AND MAGAZINES
     25        8      0.0         227            441                    -                 WEATHER BOOKS AND REFERENCE
      4        7     85.71        497            584                unknown                      GAME STRATEGY
     15        7     85.71         20            2907               unknown                 TRAVEL AND LOCAL WEATHER
      6        5      40.0        764            125                unknown             PRODUCTIVITY NEWS AND MAGAZINES


                      Table 1: Statistics of context and unwanted behaviours for 400 sample apps.
performance by adding syntax-based features, e.g.,                       a. (Object:ConnectivityManager.getActiveNetworkInfo,
                                                                                                            Runnable:URL.openConnection)
permissions and API calls, as input features. How-                       b. (Activity:WifiManager.isWifiEnabled, Activity:WebView.loadUrl)
ever, our goal is to develop a classification method                     c. (Object:WebView.loadUrl, Runnable:WifiInfo.getMacAddress)
                                                                         d. (AsyncTask:DefaultHttpClient.execute,
whose output yields better explanations. Considering                                                        Runnable:URL.openConnection)
happen-befores can capture more sophisticated app                        e. (Object:WebView.loadData,
                                                                                                  Runnable:TelephonyManager.getDeviceId)
behaviours, we prefer to using unwanted behaviours                       f. (AsyncTask:NotificationManager.notify,
selected from happen-befores for the explanation gen-                                       Object:LocationManager.getLastKnownLocation)
eration.
                                                                         They are pairs extracted from the extended call-
4      Generating Explanations                                           graphs of the apps in question. Some of them are
                                                                         trivial, e.g., the behaviour “access networks state
In the classification against the context, the features                  then connect to Internet”, supported by the feature
in the intersection between unwanted behaviours of a                     (Object:ConnectivityManager.getActiveNetwork
context and behaviours of a target app are responsible                   Info, Runnable:URL.openConnection), appears in
for a decision, so-called salient features. For a train-                 almost every app. Some of them are similar, e.g., if we
ing app in a decision context, if one of its behaviours                  want to capture the behaviour “connect to Internet”,
is salient, then this app is a supporting app for this                   then features URLConnection.openConnection and
decision. In this section, we want to exploit salient                    DefaultHttpClient.execute are considered as re-
features and their supporting apps to generate an ex-                    peated features. This redundancy will further clutter
planation for a target app. It explains how and in                       the final explanation.
what kind of context a decision was made. We want                           Based on these observations, we generate explana-
to use these automatic explanations to convince people                   tions as follows: map these salient features into simple
of the system’s automatic decision. Here is an example                   phrases, process simple phrases through templates to
automatic explanation.                                                   output compound phrases, then select the most repre-
—————————————————————————————                                            sentative compound phrases to present.
                 com.keji.danti590 (v3.0.8)                                 First, for each permission, action, event, and each
This application is malware. Its malicious behaviours are:               API call which is not governed by any permission,
                                                                         a phrase is assigned to describe its function. These
 after a USB mass storage is connected,
                                                                         phrases were extracted from their brief documents on
    it gets the superclass of a class in a runnable package              Android Developers. Second, for those permission-
    it retrieves classes in a runnable package
    it reads information about networks                                  governed API calls, we look up their corresponding
    it connects to Internet                                              permissions and use phrases for these permissions.
    it reads your phone state then connects to Internet
                                                                         Third, for pair features we combine phrases for their
The supporting apps of this explanation are:                             coordinates to form compound phrases. The templates
    com.keji.danti607 (v3.0.8) (TROJAN)                                  used in explanation are listed in Table 2. This step ac-
    com.jjdd (v1.3.1) (MALWARE)                                          tually aggregates features to reduce redundancy.
    com.keji.danti562 (v3.0.8) (TROJAN)
    com.keji.danti599 (v3.0.8) (TROJAN)                                     By using the above method, for each supporting
                                                                         app, we get a collection of phrases with their appear-
—————————————————————————————
                                                                         ance frequencies in this app. We rank phrases for each
It not only shows the decision (malware or benign)                       supporting app using the TF-IDF (term frequency -
but also elaborates the most unwanted behaviours. A                      inverse document frequency) and choose the top-m
collection of supporting apps is displayed as well.                      phrases as representatives. Then, we apply DF (docu-
   Before presenting technical details, let us have a                    ment frequency) to rank representatives of supporting
look at some salient features:                                           apps and choose the top-n phrases to present. We use


                                                                42
                                                                     5
                   Feature Type                       Template                                        Example
                                                                                  request the permission
                      permission         request the permission to do sth.
                                                                                  to change Wi-Fi connectivity state
                                                                                  might invoke the API:
                       API call          might invoke the API: API name
                                                                                  android.content.Intent.<init>
                      annotation         do sth.                                  read your phone state
                        action           sth. happens                             the app has finished booting
                        event            the user does sth.                       the user clicks a view and holds
                                                                                  read your phone state then
             (annotation, annotation)    do sth. then do sth.
                                                                                  connect to Internet
                                                                                  read SMS then
               (annotation, action)      do sth. then sth. happens
                                                                                  the app makes a phone call
                                                                                  after the system has finished booting
               (action, annotation)      after sth. happens do sth.
                                                                                  read your phone state
                                                                                  when the user touches the screen
                  (event, annotation)    when the user does sth. do sth.
                                                                                  get your precise location
                                                                                  when the user performs a gesture
                    (event, action)      when the user does sth. sth. happens
                                                                                  the app sends some data to someone elsewhere


                                    Table 2: Templates for the explanation generation.
                                    
                      0.5×f (t,d)                  |C|         nations produced from semantics-based features are
formulae 0.5 + max{f      (t,d)|t∈d}   × log10 |{d|t∈d}|  and
      |{d|t∈d}|                                                better than from syntax-based features; (b) explana-
log10 |C|       to respectively calculate TF-IDF and           tions with supporting apps are more understandable
DF, where d is the collection of phrases for each app,         than without; (c) explanations produced from context
C is the collection of all d, and f (t, d) denotes the ap-     construction are more convincing and preferable than
pearance frequency of t in d. This step helps remove           greedily extracting features from general classifiers. To
trivial phrases (features), and is formalised as follows.      test these hypotheses, we design and compare the fol-
  Function gen exp (app, judge, group, normal, unwanted, m, n) lowing methods.
    Input: the target app, the decision context,
             and the control parameters m and n.
    Output: the explanation of the target app.                               • M-Syntax: By applying the context construc-
    salient ← {}                                                               tion, from the syntax -based features (permissions
    if judge is malicious then
        salient ← feature(app) ∩ unwanted                                      and API calls), we produce explanations without
    else                                                                       including supporting apps.
        salient ← feature(app) ∩ normal
    end if
    supp ← {}                                                                • M-Semantics: By applying the context con-
    corpus ← {}                                                                struction, from the semantics-based features
    for app in group do
        features ← feature(app) ∩ salient                                      (happen-befores), we produce explanations with-
        if features 6= ∅ then                                                  out including supporting apps.
            for feature in features do
                phrase ← feature to phrase(feature)
                if not phrase in doc then                                    • M-Context: By applying the context construc-
                    doc[phrase] ← 0                                            tion, from the semantics-based features (happen-
                end if
                doc[phrase] ← doc[phrase] + frequency(feature, app)            befores), we produce explanations including sup-
            end for                                                            porting apps.
            supp ← supp ∪ {app}
            corpus ← corpus ∪ {(app, doc)}
        end if                                                               • M-L1LR: By using features with top weights in
    end for                                                                    an L1LR classifier, which is trained from the se-
    exp ← sel df(sel tfidf(corpus, m), n)
    return judge, exp, supp                                                    mantics-based features (happen-befores), we pro-
The function feature to phrase constructs a phrase for                         duce explanations including supporting apps.
a given feature by using templates given in Table 2.                          We applied the above methods to generate expla-
Functions sel tfidf and sel df will respectively select                    nations for apps in the testing set which has been de-
phrases for each supporting app and representatives                        scribed in Section 3.2. The generated explanations
for the whole collection of supporting app. The func-                      were organised into samples. Each sample consists
tion frequency produces the frequency of a feature ap-                     of two explanations for the same app, which are re-
pearing in an app.                                                         spectively produced by applying two different meth-
                                                                           ods. Two example samples are given in Figure 3.
5      Evaluation                                                             We chose three or four samples for each hypothesis
                                                                           testing. A survey consisting of 12 samples covering 10
In this section, we report a user-evaluation of the au-                    malware instances and 2 benign apps was presented to
tomatic explanations. We want to show: (a) expla-                          participants. Participants are invited to read through


                                                                  43
                                                                       6
—————————————————————————
                                                                       It shows that the context construction achieves the
           com.android.security (v4.3)
                                                                       highest average convince-score 3.61 and most respon-
            Explanation A (M-Semantics)                                dents prefer explanations produced by the context con-
This app is malware. Its malicious behaviours are:
  read your phone state then connect to Internet                       struction. We do paired T-test respectively on the
  connect to Internet then read your phone state                       three comparisons: M-Context versus M-Syntax,
  read your phone state after a phone call is made
  send SMS then read your phone state                                  M-Context versus M-Semantics, and M-Context
  read your phone state then send SMS                                  versus M-L1LR. We set the significance level at 0.05,
            Explanation B (M-Syntax)                                   then calculate the difference between their convince-
This app is malware. Its malicious behaviours are:                     scores and test the null hypothesis: the average is less
  request the permission to send SMS
  request the permission to receive SMS                                than or equal to 0. Their p-values are 0.02, 0.0002, and
  request the permission to read your phone state                      0.05 respectively. That is, all null hypotheses are re-
  request the permission to read SMS
  might invoke the API:android.content.Intent.<init>                   jected at significance level 0.05. The automatic expla-
                                                                       nation by applying the context construction is better
—————————————————————————
                                                                       than alternative methods.
           org.android.system (v1.0)
                                                                          Respondents commented that explanations revealed
            Explanation A (M-Context)                                  some behaviours they had not realised before, e.g.,
This app is malware. Its malicious behaviours are:
  read your phone state after a phone call is made
                                                                       an app called “com.antivirus.kav” sends SMS after a
  read your phone state then connect to Internet                       phone call is made, and supporting apps improve their
  send SMS then read your phone state
  read your phone state then send SMS
                                                                       understanding of the given explanation, e.g., they pre-
  send SMS after a phone call is made                                  fer to believing the given explanation is benign when
The supporting apps of this explanation are:
  com.android.security (v4.3) (MALWARE)
                                                                       they see familiar benign app names like Google Talk
  org.android.system (v1.0) (MALWARE)                                  in the supporting apps. But, some of them, especially
  ...
            Explanation B (M-L1LR)
                                                                       the malware analyst and those postgraduate students,
This app is malware. Its malicious behaviours are:                     wanted to see detailed features we use to produce ex-
  read your phone state after a phone call is made
The supporting apps of this explanation are:
                                                                       planations. This explains why M-Syntax is slightly
  com.googleapps.ru (v1.0) (TROJAN)                                    better than M-Semantics in this surverying: API
  com.keji.danti562 (v3.0.8) (MALWARE)
  ...
                                                                       names are included in explanations produced by M-
                                                                       Syntax but not in M-Semantics. In practice, we
—————————————————————————
                                                                       can hide detailed features from users and only present
                                                                       them on-demand as evidence.
Figure 3: Example explanations for hypothesis testing.

                                                                       6   Conclusion and Further Work
all samples and for each sample, to choose the expla-
nation which they prefer, and to give a convince-score                 We present a new approach to automatically gener-
between 1 and 5 to each explanation. This score in-                    ate explanations of unwanted behaviours of Android
dicates to what extent an explanation convinces the                    apps. It exploits semantics-based features, constructs
participant. We collected participants’ preferences as                 context-sensitive unwanted behaviours, and produces
well as convince-scores.                                               explanations by aggregating features into phrases.
   People from universities, software companies, and                      The context we have constructed is simple and
finance firms in UK and China were invited by mail-                    straightforward. As shown in Table 1, the groups
ing lists to participate in this survey. All participants              are unbalanced—some of them consists of hundreds of
have no idea of the mechanism behind the automatic                     apps and some consists of several malware instances.
explanation discussed in this paper. We received 20                    In further work, we want to construct more balanced
responses. These respondents include: seven junior                     and fine-grained groups such that the supervised learn-
and one senior software engineers, seven postgraduate                  ing methods can be applied to obtain well-performing
students, one lecturer, three data analysts, and one                   classifiers. By doing so, our approach to generate ex-
malware analyst. Three of them declared to be famil-                   planations can be extended to take features from well-
iar with Android programming and malware analysis.                     trained classifiers as input.
   We report the user-evaluation results as follows.                      A good classifier might not lead to a good explainer.
                                                                       As shown in Section 5, the explanations produced us-
   Method
               Convince-score      Comparison        Preference        ing the method M-L1LR are not the most preferable,
               Average Std.        M-Context            58%
                                   M-Syntax             42%
                                                                       although the L1LR classifier has better classification
  M-Syntax       3.15    0.85
                                   M-Context            78%            performance. To evaluate the quality of the automatic
 M-Semantics     3.03    0.66
                                   M-Semantics          22%            explanations is difficult. In this paper, we surveyed 20
 M-Context      3.61     0.80      M-Context            53%
  M-L1LR         3.32    0.81      M-L1LR               47%            general users to show the effectiveness of our method.


                                                              44
                                                                   7
In further work, instead of general users, we want to           [F+ 11]    Adrienne Porter Felt et al. Android per-
survey a bigger group of malware analysts, since mal-                      missions demystified. In CCS, 2011.
ware analysts are more suitable readers of these ex-
                                                                [GTGZ14] Alessandra Gorla, Ilaria Tavecchia, Flo-
planations. Also, more complex statistical models like
                                                                         rian Gross, and Andreas Zeller. Checking
ANOVA will be applied to analyse surveying results.
                                                                         app behavior against app descriptions. In
   There are still certain types of high-level behaviours
                                                                         ICSE, 2014.
that are exhibited in Android malware but cannot be
fully captured by our approach, e.g., gain root access          [GYAR13] Hugo Gascon, Fabian Yamaguchi, Daniel
and perform DDoS attacks [ZJ12]. This is because                         Arp, and Konrad Rieck. Structural detec-
these complex behaviours do not correspond to simple                     tion of Android malware using embedded
semantics-based features like happen-befores. In fur-                    call graphs. In AISec, pages 45–54, 2013.
ther work, a promising approach to remove this limita-
tion might be to exploit more semantics-based features          [KB15]     Jan-Christoph Kuester and Andreas
to capture these high-level behaviours.                                    Bauer. Monitoring real android malware.
                                                                           In Runtime Verification 2015, 2015.
References                                                      [S+ 13]    Michael Spreitzenbarth et al. Mobile-
  +
[A 14a]      Daniel Arp et al. Drebin: Efficient and                       sandbox: Having a deeper look into An-
             explainable detection of Android malware                      droid applications. In SAC, 2013.
             in your pocket. NDSS, pages 23–26, 2014.           [Tib94]    Robert Tibshirani. Regression shrinkage
[A+ 14b]     Steven Arzt et al. FlowDroid: Precise                         and selection via the lasso. Journal of
             context, flow, field, object-sensitive and                    the Royal Statistical Society, Series B,
             lifecycle-aware taint analysis for Android                    58:267–288, 1994.
             apps. In PLDI, pages 259–269, 2014.                [WROR14] Fengguo Wei, Sankardas Roy, Xinming
                                                                         Ou, and Robby. Amandroid: A precise
[ADY13]      Yousra Aafer, Wenliang Du, and Heng
                                                                         and general inter-component data flow
             Yin. DroidAPIMiner: Mining API-level
                                                                         analysis framework for security vetting of
             features for robust malware detection in
                                                                         Android apps. In CCS, 2014.
             Android. In SecureComm, 2013.
                                                                [Y+ 14]    Chao Yang et al.       Droidminer: Au-
[AZHL12]     Kathy Wain Yee Au, Yi Fan Zhou, Zhen
                                                                           tomated mining and characterization of
             Huang, and David Lie. PScout: Analyz-
                                                                           fine-grained malicious behaviors in An-
             ing the Android permission specification.
                                                                           droid applications. In ESORICS, 2014.
             In CCS, 2012.
                                                                [YSMM13] Suleiman Y. Yerima, Sakir Sezer, Gavin
[BKvOS10] David Barrera, Hilmi Günes Kayacik,
                                                                         McWilliams, and Igor Muttik. A new An-
          Paul C. van Oorschot, and Anil Somayaji.
                                                                         droid malware detection approach using
          A methodology for empirical analysis of
                                                                         bayesian classification. In AINA, 2013.
          permission-based security models and its
          application to Android. In CCS, 2010.                 [Z+ 13]    Wu Zhou et al. Fast, scalable detection
  +                                                                        of ”piggybacked” mobile applications. In
[C 13]       Kevin Zhijie Chen et al. Contextual policy
                                                                           CODASPY ’13, 2013.
             enforcement in Android applications with
             permission event graphs. In NDSS, 2013.            [Z+ 14]    Fangfang Zhang et al. Viewdroid: To-
                                                                           wards obfuscation-resilient mobile appli-
[EOM09]      William Enck, Machigar Ongtang, and                           cation repackaging detection. In WiSec,
             Patrick Drew McDaniel. On lightweight                         2014.
             mobile phone application certification. In
             CCS, pages 235–245, 2009.                          [ZDFY15]   Mu Zhang, Yue Duan, Qian Feng, and
                                                                           Heng Yin. Towards automatic genera-
[EOMC11] William Enck, Damien Octeau, Patrick                              tion of security-centric descriptions for
         McDaniel, and Swarat Chaudhuri. A                                 Android apps. In CCS, 2015.
         study of Android application security. In
         USENIX Security Symposium, 2011.                       [ZJ12]     Yajin Zhou and Xuxian Jiang. Dissecting
                                                                           Android malware: characterization and
[F+ 08]      Rong-En Fan et al. Liblinear: A library                       evolution. In IEEE Symposium on Secu-
             for large linear classification. J. Mach.                     rity and Privacy, 2012.
             Learn. Res., 9:1871–1874, June 2008.


                                                       45
                                                            8