Explaining Unwanted Behaviours in Context Wei Chen David Aspinall Andrew D. Gordon University of Edinburgh University of Edinburgh Microsoft Research Cambridge wchen2@inf.ed.ac.uk David.Aspinall@ed.ac.uk University of Edinburgh Andy.Gordon@ed.ac.uk Charles Sutton Igor Muttik University of Edinburgh Intel Security csutton@inf.ed.ac.uk igor.muttik@intel.com cally producing a short paragraph to explain unwanted behaviours. Abstract A naive method is to: train a linear classifier from a collection of identified malware instances and benign Mobile malware has been increasingly identi- apps, choose features with top weights assigned by this fied based on unwanted behaviours like send- classifier, then process selected features through tem- ing premium SMS messages. However, un- plates to output text. This method has been adopted wanted behaviours for a group of apps can in research, e.g., the Drebin system [A+ 14a]. be normal for another, i.e., they are context- However, by greedily choosing features to output sensitive. We develop an approach to au- text, the generated explanations are inaccurate. This tomatically explain unwanted behaviours in is mainly because unwanted behaviours of mobile apps context and evaluate the automatic explana- are context-sensitive, i.e., an unwanted behaviour in tions via a user-study with favourable results. one group of apps can be normal in another. For exam- These explanations not only state whether an ple, collecting locations is normal for jogging tracker app is malware but also elaborate how and in apps, but unwanted for card game apps. what kind of context a decision was made. Instead, our new approach is to organise sample apps into fine-grained groups by their behavioural sim- 1 Introduction ilarity. We set the context of an app in question to the group whose members’ behaviours are the most simi- Researchers and malware analysts have identified lar to this app’s behaviours. By exploiting behavioural hundreds and thousands of mobile apps as mal- difference between malware and benign apps in this ware [EOMC11, ZJ12] and organised them into fami- context, we decide whether the target app is malware, lies based on some unwanted behaviours, e.g., steal- and if so, we produce an explanation. Here are two ing personal information, accessing locations, col- example automatic explanations. lecting contacts information, sending premium mes- sages constantly, etc. However, except for some mal- a. This app is like a chatting app, but, after a USB ware analysis reports of several famous malware fam- massive storage is connected, it will: retrieve a class in ilies [ZJ12, S+ 13], e.g., Geinimi, Basebridge, Spitmo, a runnable package; read information about networks; Zitmo, Ginmaster, Ggtracker, Droidkungfu, etc., peo- connect to Internet. ple don’t know what kind of behaviour makes a mobile b. This app is like an anti-virus app, but, it will: app bad. This suggests a research problem: automati- read your phone state after a phone call is made; read your phone state then connect to Internet; send SMS Copyright c by the paper’s authors. Copying permitted for messages after a phone call is made. private and academic purposes. This volume is published and These explanations not only elaborate which be- copyrighted by its editors. haviour is unwanted but also give the context, e.g., In: D. Aspinall, L. Cavallaro, M. N. Seghir, M. Volkamer (eds.): Proceedings of the Workshop on Innovations in Mobile chatting and anti-virus, in which a decision was made. Privacy and Security IMPS at ESSoS’16, London, UK, 06-April- Here, the context name is from the known category 2016, published at http://ceur-ws.org names of apps in the same group by majority voting. 38 1 Our approach combines static analysis, clustering, A recent prototype DescribeMe [ZDFY15] generates supervised-learning, and text mining techniques, and text from data-flows by feeding features through hand- proceeds as follows. built templates. The main drawback is its scalablility: to produce data-flows is too expensive for most apps. • Formalisation. We approximate the behaviour The idea of context is similar with the cluster used of an Android app by an extended call-graph, i.e., in the tool CHABADA [GTGZ14]. This tool detects a collection of finite control-sequences of events, the outliers (abnormal API usage) within the clusters actions, and annotated API calls. From this graph of apps by using OC-SVM (one-class SVM). These we extract the happen-before feature which de- clusters were grouped by the descriptions of apps using notes that something happens before another. LDA (Latent Dirichlet Allocation). However, for most sample apps which were collected from alternative An- • Learning. We organise sample apps into groups droid markets, e.g., Wandoujia, Baidu, and Tencent in using clustering methods, and characterise un- China, it is hard to get their descriptions and these de- wanted behaviours for each group by exploring scriptions are often written in different languages. the difference between malware instances and be- The extended call-graphs are much more accurate nign apps within the same group. than the manifest information, e.g., permissions and actions, which were often used as input features for • Explanation. We decide whether a target app malware detection or mitigation [BKvOS10, EOM09, is malware by choosing a group then checking F+ 11]. Compared with a simple list of API calls ap- against this group whether the app has any un- pearing in the code, the extended call-graph can cap- wanted behaviour, i.e., a behaviour exhibited by a ture more sophisticated behaviours. This is needed malware instance in the group. The correspond- in practice, because: API calls appearing in the ing features are fed through hand-built templates code contain “noise” caused by the dead code and to produce text as explanations. libraries [ADY13]; and, some unwanted behaviours only arise when some API methods are called in cer- The main contributions of this paper are to: tain orders [C+ 13, KB15, Y+ 14]. On the other hand, - show that the happen-before feature is an appro- call-graphs are less accurate than models which cap- priate abstract of app behaviours with respect to ture data-flows. But, it is much easier to generate learning and explaining; the extended call-graphs using our tool for apps en masse than generating data-flows using tools like Flow- - introduce the context into behaviour explanations Droid [A+ 14b] or Amandroid [WROR14]. In partic- and develop a clustering-based algorithm (Fig- ular, people can annotate appealing API methods to ure 2) to organise sample apps into groups and generate compact graphs more efficiently, rather than construct unwanted behaviours for each group; considering all data-dependence between statements. - demonstrate that the automatic explanation in 2 Characterising App Behaviours context produces more convincing and desirable results than several other candidate methods by We use a simplified synthetic example to illustrate the surveying general users. characterisation of app behaviours. It is an Android app which constantly sends out the device ID and the phone number by SMS messages in the background 1.1 Related Work when an incoming SMS message is received. To automatically detect Android malware, machine We approximate its behaviour by using the graph learning methods have been applied to train classi- in Figure 1. It tells us: this app has two entries fiers [ADY13, GYAR13, GTGZ14, YSMM13]. All of which are respectively specified by actions MAIN and them were to obtain good fits to the training data by SMS RECEIVED; it will collect the device ID and the trying different methods and features. Explanations of phone number in a Broadcast Receiver, then send SMS chosen features have received much less consideration. messages out in an AsyncTask; the behaviour of send- The tool Drebin [A+ 14a] is the first attempt to au- ing SMS messages can also be triggered by an interac- tomatically generate explanations of Android malware. tion from the user, e.g., clicking a button, touching the It generates explanations by choosing features with top screen, long-pressing a picture, etc., which is denoted weights from a linear classifier then processes them by the word “click”. through hand-built templates to output text. A broad This graph is a collection of finite control-sequences range of syntax-based features, e.g., permissions, API of actions, events, and annotated API calls, which is calls, intents, URLs, etc., were collected for training. constructed from the bytecode of an Android app. Ac- 39 2 click AsyncTask: sendTextMessage  AsyncTask: sendTextMessage  / • MAIN / • / • O SMS RECEIVED Receiver: getLine1Number  • / • Receiver: getDeviceId Figure 1: An example extended call-graph. tions reflect what happens in the environment and a weather forecast app accesses his or her locations, what kind of service an app requests for, e.g., an in- but might feel uncomfortable if a messaging app does coming message is received, the device finishes boot- so. Therefore, to understand and explain unwanted ing, the app wants to send an email by using the ser- behaviour, we need a notion of context. vice supplied by an email-client, etc. Events denote the interaction from the user, e.g., clicking a picture, 3.1 Constructing Context pressing a button, scrolling down the screen, etc. An- notated API calls tell us whether the app does any- Unwanted behaviours in general only account for a thing we are interested in. For instance, getDeviceID, small part of a malicious app’s activities. This is getLine1Number, and sendTextMessage are annotated by design: malicious apps seek to hide their bad API calls in the above example. behaviours, and are often constructed by repackag- To construct such a graph directly from the byte- ing benign applications [Z+ 14, Z+ 13]. This obser- code, we have to model complex real-world features vation gives us a notion of context: we group to- of the Android framework, including: inter-procedural gether apps, benign or malicious, whose behaviours calls, callbacks, component life-cycles, permissions, ac- are mostly the same. Then, within the context, we tions, events, inter-component communications, multi- distinguish unwanted from normal behaviours by ex- ple threads, multiple entries, interfaces, nested classes, ploring features which are mostly associated with mal- and runtime-registered listeners. We don’t model reg- ware. This produces a fine-grained, behavioural notion isters, fields, assignments, operators, pointer-aliases, of context, that is more discriminating than categories, arrays or exceptions. The choice of which aspect to e.g., GAME, TOOLS, and WEATHER, etc., or clus- model is a trade-off between efficiency and precision. ters produced from developer-written textual descrip- In our implementation, we use an extension tions [GTGZ14]. of permission-governed API methods generated by We formalise this idea in Figure 2. Sample apps are PScout [AZHL12] as annotations. The Android plat- organised into groups. Apps in the same group share form tools aapt and dexdump are respectively used to common behaviours, in the sense that their feature extract the manifest information and to decompile the vectors are similar. Ideally, repackaged apps will be bytecode into the assembly code, from which we con- in the same group with the original benign apps. In struct the extended call-graph. practice, a group might consist of only benign apps or Once the extended call-graphs are constructed, we only malware. This depends on the feature used for can extract features for the purpose of learning un- clustering and its distribution in sample apps. wanted behaviours. In particular, we extract pairs of Two sets of features are constructed for each group: edge labels occurring in sequence, i.e., denoting that normal and unwanted. The normal set is the union of something happens before another, so-called happen- all behaviours of benign apps. The unwanted set con- befores. Generally, one can extract n-tuples. But, in sists of abnormal behaviours of malware, that is, the practice, we found that constructing triples was al- relative complement of the normal set in the collection ready too expensive: the order of magnitude for the of behaviours of malware instances. average number of triples in a typical extended call- The rule behind this construction is: a benign app graph is 104 . can not have any unwanted behaviour and a malware instance must have some unwanted behaviour whatever its other behaviours are. Every sample app in the same 3 Learning Unwanted Behaviours group is required to follow this rule. Otherwise, there A behaviour that is unwanted for one kind of app is a conflict in the group. To solve this conflict, we split can be innocuous for another. For example, sending the group into two disjoint subgroups. Then, the above SMS messages is normal for messaging apps, but un- construction will be done respectively on subgroups wanted for an E-reader app; a user might expect that until all conflicts are solved. 40 3 Function construct context (group) Input: a group of malware and benign applications sults in: it is hard to train a classifier for each group us- Output: fine-grained groups with normal and unwanted features. ing classical learning methods, e.g., SVM, naive Bayes, G ← {group} P ← {} and logistic linear regression. Therefore, we calculate has conflict ← True the distances between the target app and each group. while has conflict do has conflict ← False The closest group is chosen as the context. Then, we for group in G do decide whether the target app is malware by applying normal, unwanted ← collect behaviour (group) if detect conflict (group, normal, unwanted) then the following logic rules: group a, group b ← split group (group) G = (G − {group}) ∪ {group a, group b} • Conservatively normal. The target app is has conflict ← True else classified as benign if it has no unwanted be- G = G − {group} haviour and all its behaviours are normal, i.e., P = P ∪ {(group, normal, unwanted)}) end if feature(app) ⊆ normal. end for end while • Aggressively malicious. The target app is clas- return P sified as malicious if one of its behaviours is un- Function collect behaviour (group) wanted, i.e., feature(app) ∩ unwanted 6= ∅. normal ← {} unwanted ← {} for app in group do • Neutrally suspicious. If the target app has no if app is benign then unwanted behaviour but some of its behaviours normal = normal ∪ feature(app) unwanted = unwanted − normal are not normal. We consider its abnormal be- else haviours, i.e., feature(app)−normal, as suspicious unwanted = (unwanted ∪ feature(app)) − normal end if and label it as unknown. That is, according to end for current knowledge we can not decide whether it is return normal, unwanted malware. The decision have to be postponed until Function detect conflict (group, normal, unwanted) more sample apps of this group are acquired. for app in group do if app is benign and feature(app) 6⊆ normal then return True We randomly chose 1, 000 apps with benign and ma- end if licious half-and-half as the training set; and an equal if app is malicious and feature(app) ∩ unwanted = ∅ then return True number of apps as the testing set. They contain some end if famous benign apps, i.e., Google Talk, Amazon Kin- end for return False dle, Youtube, Facebook, etc., and some instances in fa- mous malware families, e.g., DroidKungfu, Plankton, Figure 2: Context and unwanted behaviours. Zitmo, etc. These apps spread in around 30 categories from ARCADE GAME to WEATHER. Many adver- The process starts in the function construct context tisement libraries were also found in these apps, e.g., which is invoked on the whole collection of sample Admob, Millennial Media, Airpush, etc. apps. When the algorithm terminates the following To compare our classification method with general property is satisfied: for each app in a group, if it is classifiers, we train a classifier using an implementa- malware then feature(app) ∩ unwanted 6= ∅; if it is tion liblinear [F+ 08] of L1-Regularized Logistic Re- benign then feature(app) ⊆ normal. gression [Tib94] (abbreviated as L1LR). We apply our The function split group splits a group of apps into method to construct context and collect unwanted two disjoint subgroups. Many implementations are behaviours from happen-befores which are extracted possible. We adopt the hierarchical clustering method from the extended call-graphs of apps in the training to group apps. The cosine dissimilarity between fea- set. Further, we apply the logic rules discussed ear- ture vectors is calculated and the average-linkage is lier, to decide whether a target app is malware against used to calculate the distances between clusters. unwanted behaviours for a chosen group. To illustrate the notion of context we constructed We report the classification performance as follows. unwanted behaviours of 400 randomly-chosen sample Edge Labels in Graphs Happen-Befores apps by using the above method. The ten biggest gen- Classifier Precision Recall Precision Recall erated groups are given in Table 1. context 83% 88% 80% 92% L1LR 83% 89% 85% 88% 3.2 Classification It shows that for different features the classification We want to decide whether an app in question is mal- performance of our method is only slightly worse than ware, by using the constructed context and unwanted L1LR, with no more than a 5% drop in precision. behaviours. The size and the portion of malware vary This is because some apps are labelled as unknown largely across groups, as shown in Table 1. This re- in our method. We can achieve better classification 41 4 Group Size %Malware #Normal #Unwanted Top Malware Family Top Category 0 163 93.25 6825 36813 Geinimi Fakerun ENTERTAINMENT PERSONALIZATION 5 24 100.0 0 2611 Basebridge Spitmo COMMUNICATION MUSIC AND AUDIO 21 23 39.13 3306 734 Plankton Droidkungfu WEATHER PHOTOGRAPHY 7 17 41.18 1466 295 unknown WEATHER TRANSPORTATION 19 13 15.38 1396 77 Adrd COMMUNICATION TOOLS 12 10 10.0 2027 39 Adrd MUSIC AND AUDIO NEWS AND MAGAZINES 25 8 0.0 227 441 - WEATHER BOOKS AND REFERENCE 4 7 85.71 497 584 unknown GAME STRATEGY 15 7 85.71 20 2907 unknown TRAVEL AND LOCAL WEATHER 6 5 40.0 764 125 unknown PRODUCTIVITY NEWS AND MAGAZINES Table 1: Statistics of context and unwanted behaviours for 400 sample apps. performance by adding syntax-based features, e.g., a. (Object:ConnectivityManager.getActiveNetworkInfo, Runnable:URL.openConnection) permissions and API calls, as input features. How- b. (Activity:WifiManager.isWifiEnabled, Activity:WebView.loadUrl) ever, our goal is to develop a classification method c. (Object:WebView.loadUrl, Runnable:WifiInfo.getMacAddress) d. (AsyncTask:DefaultHttpClient.execute, whose output yields better explanations. Considering Runnable:URL.openConnection) happen-befores can capture more sophisticated app e. (Object:WebView.loadData, Runnable:TelephonyManager.getDeviceId) behaviours, we prefer to using unwanted behaviours f. (AsyncTask:NotificationManager.notify, selected from happen-befores for the explanation gen- Object:LocationManager.getLastKnownLocation) eration. They are pairs extracted from the extended call- 4 Generating Explanations graphs of the apps in question. Some of them are trivial, e.g., the behaviour “access networks state In the classification against the context, the features then connect to Internet”, supported by the feature in the intersection between unwanted behaviours of a (Object:ConnectivityManager.getActiveNetwork context and behaviours of a target app are responsible Info, Runnable:URL.openConnection), appears in for a decision, so-called salient features. For a train- almost every app. Some of them are similar, e.g., if we ing app in a decision context, if one of its behaviours want to capture the behaviour “connect to Internet”, is salient, then this app is a supporting app for this then features URLConnection.openConnection and decision. In this section, we want to exploit salient DefaultHttpClient.execute are considered as re- features and their supporting apps to generate an ex- peated features. This redundancy will further clutter planation for a target app. It explains how and in the final explanation. what kind of context a decision was made. We want Based on these observations, we generate explana- to use these automatic explanations to convince people tions as follows: map these salient features into simple of the system’s automatic decision. Here is an example phrases, process simple phrases through templates to automatic explanation. output compound phrases, then select the most repre- ————————————————————————————— sentative compound phrases to present. com.keji.danti590 (v3.0.8) First, for each permission, action, event, and each This application is malware. Its malicious behaviours are: API call which is not governed by any permission, a phrase is assigned to describe its function. These after a USB mass storage is connected, phrases were extracted from their brief documents on it gets the superclass of a class in a runnable package Android Developers. Second, for those permission- it retrieves classes in a runnable package it reads information about networks governed API calls, we look up their corresponding it connects to Internet permissions and use phrases for these permissions. it reads your phone state then connects to Internet Third, for pair features we combine phrases for their The supporting apps of this explanation are: coordinates to form compound phrases. The templates com.keji.danti607 (v3.0.8) (TROJAN) used in explanation are listed in Table 2. This step ac- com.jjdd (v1.3.1) (MALWARE) tually aggregates features to reduce redundancy. com.keji.danti562 (v3.0.8) (TROJAN) com.keji.danti599 (v3.0.8) (TROJAN) By using the above method, for each supporting app, we get a collection of phrases with their appear- ————————————————————————————— ance frequencies in this app. We rank phrases for each It not only shows the decision (malware or benign) supporting app using the TF-IDF (term frequency - but also elaborates the most unwanted behaviours. A inverse document frequency) and choose the top-m collection of supporting apps is displayed as well. phrases as representatives. Then, we apply DF (docu- Before presenting technical details, let us have a ment frequency) to rank representatives of supporting look at some salient features: apps and choose the top-n phrases to present. We use 42 5 Feature Type Template Example request the permission permission request the permission to do sth. to change Wi-Fi connectivity state might invoke the API: API call might invoke the API: API name android.content.Intent. annotation do sth. read your phone state action sth. happens the app has finished booting event the user does sth. the user clicks a view and holds read your phone state then (annotation, annotation) do sth. then do sth. connect to Internet read SMS then (annotation, action) do sth. then sth. happens the app makes a phone call after the system has finished booting (action, annotation) after sth. happens do sth. read your phone state when the user touches the screen (event, annotation) when the user does sth. do sth. get your precise location when the user performs a gesture (event, action) when the user does sth. sth. happens the app sends some data to someone elsewhere Table 2: Templates for the explanation generation.   0.5×f (t,d) |C| nations produced from semantics-based features are formulae 0.5 + max{f (t,d)|t∈d} × log10 |{d|t∈d}| and |{d|t∈d}| better than from syntax-based features; (b) explana- log10 |C| to respectively calculate TF-IDF and tions with supporting apps are more understandable DF, where d is the collection of phrases for each app, than without; (c) explanations produced from context C is the collection of all d, and f (t, d) denotes the ap- construction are more convincing and preferable than pearance frequency of t in d. This step helps remove greedily extracting features from general classifiers. To trivial phrases (features), and is formalised as follows. test these hypotheses, we design and compare the fol- Function gen exp (app, judge, group, normal, unwanted, m, n) lowing methods. Input: the target app, the decision context, and the control parameters m and n. Output: the explanation of the target app. • M-Syntax: By applying the context construc- salient ← {} tion, from the syntax -based features (permissions if judge is malicious then salient ← feature(app) ∩ unwanted and API calls), we produce explanations without else including supporting apps. salient ← feature(app) ∩ normal end if supp ← {} • M-Semantics: By applying the context con- corpus ← {} struction, from the semantics-based features for app in group do features ← feature(app) ∩ salient (happen-befores), we produce explanations with- if features 6= ∅ then out including supporting apps. for feature in features do phrase ← feature to phrase(feature) if not phrase in doc then • M-Context: By applying the context construc- doc[phrase] ← 0 tion, from the semantics-based features (happen- end if doc[phrase] ← doc[phrase] + frequency(feature, app) befores), we produce explanations including sup- end for porting apps. supp ← supp ∪ {app} corpus ← corpus ∪ {(app, doc)} end if • M-L1LR: By using features with top weights in end for an L1LR classifier, which is trained from the se- exp ← sel df(sel tfidf(corpus, m), n) return judge, exp, supp mantics-based features (happen-befores), we pro- The function feature to phrase constructs a phrase for duce explanations including supporting apps. a given feature by using templates given in Table 2. We applied the above methods to generate expla- Functions sel tfidf and sel df will respectively select nations for apps in the testing set which has been de- phrases for each supporting app and representatives scribed in Section 3.2. The generated explanations for the whole collection of supporting app. The func- were organised into samples. Each sample consists tion frequency produces the frequency of a feature ap- of two explanations for the same app, which are re- pearing in an app. spectively produced by applying two different meth- ods. Two example samples are given in Figure 3. 5 Evaluation We chose three or four samples for each hypothesis testing. A survey consisting of 12 samples covering 10 In this section, we report a user-evaluation of the au- malware instances and 2 benign apps was presented to tomatic explanations. We want to show: (a) expla- participants. Participants are invited to read through 43 6 ————————————————————————— It shows that the context construction achieves the com.android.security (v4.3) highest average convince-score 3.61 and most respon- Explanation A (M-Semantics) dents prefer explanations produced by the context con- This app is malware. Its malicious behaviours are: read your phone state then connect to Internet struction. We do paired T-test respectively on the connect to Internet then read your phone state three comparisons: M-Context versus M-Syntax, read your phone state after a phone call is made send SMS then read your phone state M-Context versus M-Semantics, and M-Context read your phone state then send SMS versus M-L1LR. We set the significance level at 0.05, Explanation B (M-Syntax) then calculate the difference between their convince- This app is malware. Its malicious behaviours are: scores and test the null hypothesis: the average is less request the permission to send SMS request the permission to receive SMS than or equal to 0. Their p-values are 0.02, 0.0002, and request the permission to read your phone state 0.05 respectively. That is, all null hypotheses are re- request the permission to read SMS might invoke the API:android.content.Intent. jected at significance level 0.05. The automatic expla- nation by applying the context construction is better ————————————————————————— than alternative methods. org.android.system (v1.0) Respondents commented that explanations revealed Explanation A (M-Context) some behaviours they had not realised before, e.g., This app is malware. Its malicious behaviours are: read your phone state after a phone call is made an app called “com.antivirus.kav” sends SMS after a read your phone state then connect to Internet phone call is made, and supporting apps improve their send SMS then read your phone state read your phone state then send SMS understanding of the given explanation, e.g., they pre- send SMS after a phone call is made fer to believing the given explanation is benign when The supporting apps of this explanation are: com.android.security (v4.3) (MALWARE) they see familiar benign app names like Google Talk org.android.system (v1.0) (MALWARE) in the supporting apps. But, some of them, especially ... Explanation B (M-L1LR) the malware analyst and those postgraduate students, This app is malware. Its malicious behaviours are: wanted to see detailed features we use to produce ex- read your phone state after a phone call is made The supporting apps of this explanation are: planations. This explains why M-Syntax is slightly com.googleapps.ru (v1.0) (TROJAN) better than M-Semantics in this surverying: API com.keji.danti562 (v3.0.8) (MALWARE) ... names are included in explanations produced by M- Syntax but not in M-Semantics. In practice, we ————————————————————————— can hide detailed features from users and only present them on-demand as evidence. Figure 3: Example explanations for hypothesis testing. 6 Conclusion and Further Work all samples and for each sample, to choose the expla- nation which they prefer, and to give a convince-score We present a new approach to automatically gener- between 1 and 5 to each explanation. This score in- ate explanations of unwanted behaviours of Android dicates to what extent an explanation convinces the apps. It exploits semantics-based features, constructs participant. We collected participants’ preferences as context-sensitive unwanted behaviours, and produces well as convince-scores. explanations by aggregating features into phrases. People from universities, software companies, and The context we have constructed is simple and finance firms in UK and China were invited by mail- straightforward. As shown in Table 1, the groups ing lists to participate in this survey. All participants are unbalanced—some of them consists of hundreds of have no idea of the mechanism behind the automatic apps and some consists of several malware instances. explanation discussed in this paper. We received 20 In further work, we want to construct more balanced responses. These respondents include: seven junior and fine-grained groups such that the supervised learn- and one senior software engineers, seven postgraduate ing methods can be applied to obtain well-performing students, one lecturer, three data analysts, and one classifiers. By doing so, our approach to generate ex- malware analyst. Three of them declared to be famil- planations can be extended to take features from well- iar with Android programming and malware analysis. trained classifiers as input. We report the user-evaluation results as follows. A good classifier might not lead to a good explainer. As shown in Section 5, the explanations produced us- Method Convince-score Comparison Preference ing the method M-L1LR are not the most preferable, Average Std. M-Context 58% M-Syntax 42% although the L1LR classifier has better classification M-Syntax 3.15 0.85 M-Context 78% performance. To evaluate the quality of the automatic M-Semantics 3.03 0.66 M-Semantics 22% explanations is difficult. In this paper, we surveyed 20 M-Context 3.61 0.80 M-Context 53% M-L1LR 3.32 0.81 M-L1LR 47% general users to show the effectiveness of our method. 44 7 In further work, instead of general users, we want to [F+ 11] Adrienne Porter Felt et al. Android per- survey a bigger group of malware analysts, since mal- missions demystified. In CCS, 2011. ware analysts are more suitable readers of these ex- [GTGZ14] Alessandra Gorla, Ilaria Tavecchia, Flo- planations. Also, more complex statistical models like rian Gross, and Andreas Zeller. Checking ANOVA will be applied to analyse surveying results. app behavior against app descriptions. In There are still certain types of high-level behaviours ICSE, 2014. that are exhibited in Android malware but cannot be fully captured by our approach, e.g., gain root access [GYAR13] Hugo Gascon, Fabian Yamaguchi, Daniel and perform DDoS attacks [ZJ12]. This is because Arp, and Konrad Rieck. Structural detec- these complex behaviours do not correspond to simple tion of Android malware using embedded semantics-based features like happen-befores. In fur- call graphs. In AISec, pages 45–54, 2013. ther work, a promising approach to remove this limita- tion might be to exploit more semantics-based features [KB15] Jan-Christoph Kuester and Andreas to capture these high-level behaviours. Bauer. Monitoring real android malware. In Runtime Verification 2015, 2015. References [S+ 13] Michael Spreitzenbarth et al. Mobile- + [A 14a] Daniel Arp et al. Drebin: Efficient and sandbox: Having a deeper look into An- explainable detection of Android malware droid applications. In SAC, 2013. in your pocket. NDSS, pages 23–26, 2014. [Tib94] Robert Tibshirani. Regression shrinkage [A+ 14b] Steven Arzt et al. FlowDroid: Precise and selection via the lasso. Journal of context, flow, field, object-sensitive and the Royal Statistical Society, Series B, lifecycle-aware taint analysis for Android 58:267–288, 1994. apps. In PLDI, pages 259–269, 2014. [WROR14] Fengguo Wei, Sankardas Roy, Xinming Ou, and Robby. Amandroid: A precise [ADY13] Yousra Aafer, Wenliang Du, and Heng and general inter-component data flow Yin. DroidAPIMiner: Mining API-level analysis framework for security vetting of features for robust malware detection in Android apps. In CCS, 2014. Android. In SecureComm, 2013. [Y+ 14] Chao Yang et al. Droidminer: Au- [AZHL12] Kathy Wain Yee Au, Yi Fan Zhou, Zhen tomated mining and characterization of Huang, and David Lie. PScout: Analyz- fine-grained malicious behaviors in An- ing the Android permission specification. droid applications. In ESORICS, 2014. In CCS, 2012. [YSMM13] Suleiman Y. Yerima, Sakir Sezer, Gavin [BKvOS10] David Barrera, Hilmi Günes Kayacik, McWilliams, and Igor Muttik. A new An- Paul C. van Oorschot, and Anil Somayaji. droid malware detection approach using A methodology for empirical analysis of bayesian classification. In AINA, 2013. permission-based security models and its application to Android. In CCS, 2010. [Z+ 13] Wu Zhou et al. Fast, scalable detection + of ”piggybacked” mobile applications. In [C 13] Kevin Zhijie Chen et al. Contextual policy CODASPY ’13, 2013. enforcement in Android applications with permission event graphs. In NDSS, 2013. [Z+ 14] Fangfang Zhang et al. Viewdroid: To- wards obfuscation-resilient mobile appli- [EOM09] William Enck, Machigar Ongtang, and cation repackaging detection. In WiSec, Patrick Drew McDaniel. On lightweight 2014. mobile phone application certification. In CCS, pages 235–245, 2009. [ZDFY15] Mu Zhang, Yue Duan, Qian Feng, and Heng Yin. Towards automatic genera- [EOMC11] William Enck, Damien Octeau, Patrick tion of security-centric descriptions for McDaniel, and Swarat Chaudhuri. A Android apps. In CCS, 2015. study of Android application security. In USENIX Security Symposium, 2011. [ZJ12] Yajin Zhou and Xuxian Jiang. Dissecting Android malware: characterization and [F+ 08] Rong-En Fan et al. Liblinear: A library evolution. In IEEE Symposium on Secu- for large linear classification. J. Mach. rity and Privacy, 2012. Learn. Res., 9:1871–1874, June 2008. 45 8