=Paper=
{{Paper
|id=Vol-1584/paper19
|storemode=property
|title=A Study of Android Malware Detection Techniques and Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-1584/paper19.pdf
|volume=Vol-1584
|authors=Balaji Baskaran,Anca Ralescu
|dblpUrl=https://dblp.org/rec/conf/maics/BaskaranR16
}}
==A Study of Android Malware Detection Techniques and Machine Learning==
Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 A Study of Android Malware Detection Techniques and Machine Learning Balaji Baskaran and Anca Ralescu EECS Department University of Cincinnati Cincinnati, OH 45221 - 0030 baskarbi@mail.uc.edu, anca.ralescu@uc.edu Abstract Based on the current attack trends and anal- ysis of the present literatures, (Raveendranath Android OS is one of the widely used mobile Operat- ing Systems. The number of malicious applications and et al.(2014)Raveendranath, Rajamani, Babu, and Datta) lists adwares are increasing constantly on par with the num- the types of malwares as follows: ber of mobile devices. A great number of commercial 1. Information Extraction signature based tools are available on the market which Compromises the device and steals personal information prevent to an extent the penetration and distribution of such as IMEI number, user’s personal information, etc. malicious applications. Numerous researches have been conducted which claims that traditional signature based 2. Automatic Calls and SMS detection system work well up to certain level and mal- User’s phone bill is increased by making calls and sending ware authors use numerous techniques to evade these SMS to some premium numbers tools. So given this state of affairs, there is an increas- ing need for an alternative, really tough malware de- 3. Root Exploits tection system to complement and rectify the signature The malware will gain system root privileges and takes based system. Recent substantial research focused on control of the system and modifies the information. machine learning algorithms that analyze features from 4. Search Engine Optimizations malicious application and use those features to classify Artificially search for a term and simulates clicks on tar- and detect unknown malicious applications. This study summarizes the evolution of malware detection tech- geted websites in order to increase the revenue of a search niques based on machine learning algorithms focused engine or increase the traffic on a website. on the Android OS. 5. Dynamically Downloaded code An installed benign application downloads a malicious Introduction code and deploys it in the mobile devices. According to a 2014 research study (RiskIQ(2014)), ma- 6. Covert channel licious applications in Google Play Store have increased A vulnerability in the devices that facilitates the informa- 388% between 2011 and 2013. tion leak between the processes that are not supposed to As the initial part of our research, we conducted an ex- share the information. tensive study where we analyze the current trends and ap- 7. Botnets proaches on detecting the malwares on Android Systems us- A network of compromised mobile devices with a Bot- ing Machine Learning techniques. The overall goal of this Master which is controlled by Command and Control study is to identify the research so far on Android Mal- servers (C&C). Carry out Spam delivery, DDDos attacks ware detections using Machine Leaning Techniques. With on the host devices. this analysis we can formulate a defense mechanism specif- ically to counteract the Update attack, the most difficult in- From this point on, the structure of the paper is as follows. trusion technique to detect and eliminate. Section is a general overview of current security deployed Update Attack: In Android update attack is defined as by play-store. Classification of various methods used in de- the benign application installed in the system downloads tecting malwares in Android systems is presented in Section malicious payloads while updating itself or downloads third . The paper concludes in Section party malicious applications and installs in the system. This type of attack is very hard to detect because the original ap- Overview of Android System Security plication is benign. Unless we track the installed previous Google Play Store uses an in-house malicious applica- versions and the application after the update we cannot de- tion detection system called Bouncer. But researchers have tect the malicious activity. We aim to give a brief approach proved that Bouncer’s ability to detect the malicious applica- on counteracting the update attack with the survey on recent tion is minimal and they could successfully publish a proto- trends on Malware detection. type malicious application in play-store. Android Play-Store 15 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 uses application’s meta data such as user’s rating and user’s permission, framework methods and framework classes for comments to flag a malicious application. But by the time their classification system. the malicious application is detected, it could have made (Sanz et al.(2012)Sanz, Santos, Laorden, Ugarte-Pedrero, enough damage to the affected mobile system. and Bringas) extracted the strings in the application, permis- Malware authors use many techniques to evade the detec- sions, user rating, number of ratings, size of the application tion such as (a) code obfuscation technique, (b) encryption, and used Bayesian Networks, J48 Decision Tree and Ran- (c) including permissions which are not needed by the appli- dom Forest, SVM with SMO kernel. A total of 820 samples cation, (d) requesting for unwanted hardwares, (e) download were used to test and the authors concluded that they could or update attack in which a benign application updates itself achieve a very high accuracy with less false positive rate. or another application now with malicious payload, which (Ghorbanzadeh et al.(2013)Ghorbanzadeh, Chen, Ma, is very tough to detect. This also encourages the need for Clancy, and McGwier) used Neural Networks to detect an new researches on the detection techniques, including ma- application’s category from permissions by means of multi chine learning based techniques. Many studies have shown layered feed forward networks. A feed forward Neural Net- that machine learning algorithms to detect the malicious ac- work is built with two layers each containing 10 neurons. tivities are successful in detecting them with very high accu- The hidden layer contains sigmoid transfer function and the racy. output linear transfer functions was deployed. The suthoud assumed that the permissions declared in the manifest file Android Malware Detection may be manipulated by the malware authors and they may misrepresent the categories declared in the manifest. So to Based on the features used to classify an application, we can simulate this property, the authors permuted permissions of categorize the analysis as Static and Dynamic. Static anal- 50% of the test data and fed into the network. ysis is done without running an application. Examples of (Yerima et al.(2013)Yerima, Sezer, McWilliams, and static features include, (a) permissions, (b) API calls which Muttik) used 2000 applications with 1000 malicious and can be extracted from the AndroidManifest.xml file. Dy- 1000 benign applications. They extracted features like Per- namic analysis deals with features that were extracted from missions, API calls, Native Linux System commands and the application while running, including (a) network traffic, various features from manifest and class files. The mal- (b) battery usage, (c) IP address, etc. The third type of anal- ware authors embed native linux commads such as chnown, ysis is hybrid analysis which combines the features from mount, remount, etc., and run them in the Android system static and dynamic techniques. The rest of this section de- when the application is launched. Mutual information (en- scribes the features extracted from the application and ma- tropy) is used to rank the features and then a Bayesian Clas- chine learning algorithm used. sifier is used for classification. (Samra et al.(2013)Samra, Yim, and Ghanem) extracted Static Analysis features from AndroidManifest.xml such as count of xml el- In static analysis, the features are extracted from the appli- ements, application specific information such as name, cat- cation file without executing the application. This method- egory, description, rating, package info, description, rating ology is resource and time efficient as the application is not values, rating counts and price. The information from 18174 executed. But at the same time, this analysis suffers from android application with 4612 business category and 13535 code obfuscation techniques the Malware authors employ to tools were extracted by using web crawlers. They were clus- evade from static detection techniques. One of very popular tered using K-Means clustering. evasion technique is the Update Attack: a benign applica- (Peiravian and Zhu(2013)) utilized permission, API calls tion is installed on the mobile device and when the appli- and the combination of both as features. The two types of cation gets an update, the malicious content is downloaded permissions in Android, requested permission and required and installed as part of the update. This cannot be detected permission are used to express an application as a binary by static analysis techniques which will scan only the benign vector where Pi = 1 iff the Manifest.xml has the ith per- application. mission. Same as permission, API calls are also expressed The most commonly used static features are the Permis- as a binary vector with AP Ii = 1 iff there is the API call sion and API calls. Since these are extracted from the appli- made in the application. These two features are concatenated cation AndroidManifest.xml and influence the malware de- and the third feature is formed. A total of 2510 samples in- tection rate to a high extent, extensive research has been cluding 1260 are malicious and 1250 benign are used. The made with these as features as well as combined with other authors concluded that Bagging, an ensemble classification features extracted from meta-data available in Google Play- method has the best performance in classifying all created Store such as version name, version no., author’s name, last datasets. updated time, etc., (Liu(2013)) investigated three specific types of malware: (Sahs and Khan(2012)) used permissions and Control SMS-related, control-related and spy-related. An applica- Flow Graphs(CFG) as features and used One-class Support tion’s permission and ¡uses-feature¿ xml tag which requests Vector Machine(SVM). The most of training data are benign the necessary hardware devices needed to run the applica- applications and the classifier will classify a sample as mali- tion, is extracted and used as features. Information Gain is cious only if it is sufficiently different from the benign class. used to select important features and SVM with the basic (Shabtai et al.(2010)Shabtai, Fledel, and Elovici) used classifier is used to detect the malicious application. The au- 16 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 thors could detect Spy-related applications with an accuracy applications. A total of 28,548 benign applications and 1,536 of 81%, SMS-related with malicious applications with an malicious applications and permission pairs i.e., combina- accuracy of 97% and Control-related malicious applications tion of any two requested permission are analyzed. The two with an 100% accuracy and could detect benign applications layered approach helped to balance the detection accuracy with an accuracy of 88%. and detection speed of the classifier. In Phase 1, requested (Glodek and Harang(2013)) constructed five Random permissions and the J48 Decision Tree algorithm is used in Forests with 5-fold cross validation and compared their per- detection and in Phase 2, requested permission pairs and the formance in detecting malicious applications. They have J48 Decision Tree is used for detection. If there is any con- used 500 malicious and 500 benign from North Carolina tradiction in the results obtained from both the phases, used State University’s malware project. Permission, broadcast permission pairs and J48 is used to classify again. The au- receivers and native code embedded in the application are thors achieved a good result with this approach and recom- used as features and they concluded that their method out- mended using the permission in component level than the performs a lot of commercial anti virus detection tools. application level for better detection of malicious activities. (Jerome et al.(2014)Jerome, Allix, State, and Engel) ex- (Ideses and Neuberger(2014)) used permission, broadcast tracted the opcodes from class.dex file and translated into receivers and activities, byte code fragments, system-calls opcode sequences, binary sequences of k-grams that charac- as features and trained SVM with the training dataset. The terize the least functionalities required by a program. They researchers tested their proposed Malware detection system trained their model with Gnome Project dataset and ran- with a security tester for benchmarking where their system domly picked 1246 applications from the Google Play Store. was tested with 7,000 samples. They conclude that their sys- The test dataset consists of 25,476 malware samples, 15670 tem could achieve about 99.3% positive rate with just 0.14% benign applications from VirusTotal. Information Gain was false alarm rate. used to select important features among the available ones. (Yerima et al.(2014a)Yerima, Sezer, and McWilliams) The author used a linear implementation of SVM to clas- presented and analyzed three Bayesian classification ap- sify application samples. The results were compared with proaches for detecting Android malwares. Permissions and the detection rate of 25 anti virus tools. The study release an code based properties such as API calls, both Java system interesting signature patterns of Malware, Goodware, False based and Android system based, Linux and Android system Positives and False Negatives of their classifier. The false commands are also extracted from the sample applications. negatives were found out to be adwares and they were also A list of top 20 permissions and top 25 API calls used by considered a threat by the tool. benign and malicious applications are presented. (Pehlivan et al.(2014)Pehlivan, Baltaci, Acarturk, and (Fazeen and Dantu(2014)) used combines Intentions of Baykal) used 3748 application packages, developed C# the applications esp., Task Intentions with permission as fea- scripts to automatically extract about 182 attributes that in- ture in developing their model. At first the requested per- clude Permissions, version no and version name of the appli- missions are extracted and a histogram is constructed for cations. The study compared feature selection methods such that task-intention category. Normalizing this results in an as Gain Ratio Attribute Evaluator, Relief Attribute Evalua- I shaped PMF. This shape is used to compare and detect the tor, Control Flow Subset Evaluator, and Consistency Subset unknown applications as benign or malicious based on their Evaluator and machine learning algorithms Bayesian clas- Task Intentions. The system works as follows: sification, Classification and Regression Tree (CART), J48 • Phase I trains and uses machine learning algorithms to DT, RF, SMO. Using the feature selection methods, they find the task intentions of the sample applications. came up with 97 features that could represent the whole dataset. Finally the authors conclude that, with just 25 fea- • Phase II uses the knowledge from Phase I to find the task tures, the Control Flow Subset Evaluator selection gave a intention of an unknown application and classify as be- good performance and Random forest and J48 performed nign or malicious. The I shape is compared with the re- better than Bayesian classifier. quested permission by using a using a matching ratio, that (Chan and Song(2014)) analyzed 796 benign and 175 ma- is generated by a machine learning algorithm. If the ratio licious applications for their study. Permissions used from is in a threshold, then the application is potentially safe. the manifest.xml file and API call info from the classes.dex The authors used Naive Bayes, Multi Layered Perceptron file are extracted and with Information Gain they selected and Random Forests and compared their performances. a set of 19 relevant API calls. They compared the results (Xiaoyan et al.(2014)Xiaoyan, Juan, and Xiujuan) ex- obtained by machine learning algorithms such as Naive tracted permissions from the manifest and represented as a Bayes, SVM with SMO algorithm, RBF Network, Multi binary vector. Then Principle Component Analysis (PCA) Layer Perceptron, Liblinear, J48 decision tree and Random is performed to select the best features. A linear SVM is Forests.The authors concluded that the were able to get 90% trained to classify the app samples. The author compares the of the accuracy by using the API calls and permission com- result with other classifiers such as J48 Decision Tree, Naive bined than using the individual features alone. Bayes, BayesNet, CART, RandomForest and concludes that (Liu and Liu(2014)) combined the two types of permis- SVM gives a better performance. sions, required permission and requested permission and de- (Yerima et al.(2014b)Yerima, Sezer, and Muttik) came signed a two layer approach with these features and em- up with a parallel implementation of their system to detect ployed machine learning algorithms to detect the malicious malicious android applications. They used application re- 17 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 lated feature such as permissions, Standard OS and android framework commands. They developed parallel implemen- Table 1: Topmost used features in static analysis tation of Logistic function based classifier, Naive Bayes Sl. No. Feature - probabilistic method and PART, RIDOR which are rule 1 Permission based classifier. with the features extracted, the classification 2 API calls is performed with the individual algorithms and then paral- 3 Strings extracted lel implementation is carried out. The maximum probability 4 Native commands scheme fetched an accuracy of 97.5%. 5 XML elements (Idrees and Rajarajan(2014)) combines permissions and 6 Meta data Intents and used 292 applications for training and 340 for 7 Opcodes from .dex file testing their model. The study describes some usage statis- 8 Task Intents tics of benign and malicious applications with regards to intents and permissions and developed Naive Bayes, Kstar, Prism to detect the maliciuos applications from benign ap- Table 2: Top features combined with other features in static plications. analysis (Munoz et al.(2015)Munoz, Martin, Guzman, and Her- Feature Combined With nandez) The authors collected the information from Google Broadcast receivers Play meta-data such as intrinsic application features, Appli- Uses-feature tag cation category, Developer related feature, certificate related Permissions Android OS commands feature, social related feature. They concluded that certifi- API calls cate and developer information, intrinsic application feature meta-data are the most promising feature to determine a malware with opcodes just meta data. Features extracted from (Westyarian et al.(2015)Westyarian, Rosmansyah, and manifest files and class API calls Dabarsyah) used 205 benign and 207 malicious applica- files tion files and extracted API calls that are only related to the permission declared in ¡used-permission¿ label in man- ifest.xml file. The study concluded that 97% of the mal- ware requests telephonyManager and connectivityManager Dynamic Analysis are the most important features. Random forest classifica- (Wei et al.(2012)Wei, Mao, Jeng, Lee, Wang, and Wu) used tion obtains 92.4% with cross validation as feature selection Droidbox, a tool to monitor the application real time, to dy- algorithm and SVM obtain 91.4% with percentage split as namically analyze the behavior of android applications. IP feature selection algorithm. address of the source is extracted from the network traffic after then application is run in a sandbox environment. The (Chuang and Wang(2015)) collected API calls from be- research concentrated only on the network characteristics of nign application separately and API calls from malicious the malwares leveraging the fact that they will find their next applications separately and used these as features for clas- target soon. The extracted IP address is used to find the spa- sifying an unknown sample. The APIs in the unknown are tial address using external services and to determine the uni- ranked according to their difference in the number of occur- formity of geographic distribution of the hosts because in- rences in benign and number of occurrences in malicious ap- fected hosts will be distributed worldwide. After extracting plications. Then they deploy single a model approach where the features, a M xN APP-GEO Matrix is constructed with they will combine the two feature sets into a single vector. M representing the android applications(rows) and N net- In Malicious model approach only the hypothesis from Ma- work features. ICA (Independent Component Analysis) to licious tended APIs is used for classification. The Hybrid ap- extract the latent concept or sparse from the noisy spamming proach combines two separately trained SVM models. These data. The researchers used Weka and FastICA, the two open results are then compared to predict whether the unknown source libraries to evaluate their model. A total of 310 mal- sample is malicious. The Hybrid model behaved much bet- ware samples were used and they could achieve about 93% ter than the Malicious model but the single model obtained accuracy rate. from combined features outperformed the Malicious model. (Ham and Choi(2013)) used 30 normal apps and 5 mal- Table 1 shows the top frequent used features in static anal- ware samples (GoldDream, PJApps, DroidKungFu2, Snake ysis. Table 2 summarizes the top features that are combined and Angry Birds Rio Unlocker) in this study. The allocated with other features to produce better detection rate. By ob- resources when the app starts are monitored and the behav- serving the table 1 and table 2, it can be clearly seen that ioral pattern is extracted. hese resource data are stored within Permission and API calls, the two features extracted from the device and are converted into feature vectors. Each fea- Manifest file and .dex file produces higher detection rate and ture is subdivided to 7 categories, 1. Network, SMS, CPU, inorder to make them more fail safe these can be combined and power usage, Process ( like ID, Name , running pro- with other features such as mate-data collected from Google cess), memory Native, Dalvik and other and Virtual Mem- Play Store or the features extracted from the XML elements. ory.32 features are related to malware detection and applied 18 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 Information Gain to select features. They used Naive Bayes, each invoked call is counted. PCA is used in selecting the Random Forest LR, SVM with 10 fold cross validation. important feature and then the classifier classifies the appli- The authors concluded that Naive Bayes/LRs confusion cation sample malicious or benign based on anomaly score matrix are irregular in distribution with these features. SVM obtained by the input. The author compared their system’s correctly classified normal type data almost 100% but falsely performance with classifiers such as Naive Bayes, J48 De- detected malicious applications as benign. Random Forests cision Tree and SVM and claims that they could achieve outperformed all the algorithms and correctly classifies the 98.4% detection rate. majority of normal and malware applications. (Kim and Choi(2014)) Linux based features are extracted (Lu et al.(2013)Lu, Zulie, Jingju, and Yi) compared from the Android Os and used as feature to detect malicious Bayesian method alone and Bayesian method combined applications. There were 59 features obtained like, Mem- with Chi Square feature selection method results are com- ory, CPU, Network, etc. 6 malwares were run on the system pared to evaluate the performance of the two ML algo- and the system is monitored to collect the above said fea- rithms. The study concluded that Bayesian method with Chi tures. Every 10 seconds the data is collected and sent over Squared yielded an accuracy of 89% while Bayesian method to a server and the server does the classification. Out of 59 alone yielded 80%. features, 36 are selected and the results are compared be- (Tenenboim-Chekina et al.(2013)Tenenboim-Chekina, fore and after applying feature selection. It has been said Barad, Shabtai, Mimran, Rokach, Shapira, and Elovici) used that the feature selection improves the accuracy and reduces 5 to 10 self-written Trojan malware with two versions of the the False Positive Rate of the classification. malware, one benign and other malicious which is repacked (Kurniawan et al.(2015)Kurniawan, Rosmansyah, and version of the benign with malicious code. While the ap- Dabarsyah) used Logger, a default application which is in- plication is running, Many network based features are ex- built in Android was used to extract the sum of Internet tracted. The self-written applications are installed in the de- traffic, percentage of battery used and battery temperature vices and their behavior was collected and analyzed. This for every minute. These information collected as set of fea- helps the traffic patterns distinguishable from benign and tures and is fed into weka, a open source learning library malicious. Feature measurements are performed at fixed for testing and training with Naive Bayes, J48 decision tree time intervals and then aggregation functions are computed and Random Forest algorithms. The author concluded that over these measurements. Cross feature analysis is used Random Forest has high accuracy of 85.6% with these fea- to explore the correlation between features. The deviations tures and proposes other features that can be combined with caused by abnormal activities from normal activities are ob- existing system to improve the accuracy. served. With labeled samples a threshold of deviation is ob- Table 3 summarizes the most frequently used features in tained during the algorithm formulation. The study could Dynamic analysis. As seen, Network traffic which includes successfully detect the repacked malicious applications us- data packets sent, and other behavioral patterns can lead to ing the network features learned. quick detection of malicious activity. Tracing the IP address (Alam and Vuong(2013)) rooted the mobile device to get can help us to get the geographical landscape of the attack the details such as, data being sent by applications, IP ad- surface. Other than this, SMS, information logged by Log- dress being communicated, number of active communica- ger and Strace is very much helpful in achieving a higher tions, the system calls and used Random forest with 1330 detection rate. malicious and 407 benign applications. The authors con- cluded that with more trees and less feature per tree in the Hybrid Analysis Random Forest, they could achieve an accuracy of 99%. (Mas’ud et al.(2014)Mas’ud, Sahib, Abdollah, Selamat, The hybrid methodology involves combining static and dy- and Yusof) monitored the system call of 30 normal appli- namic features collected from analyzing the application and cations and 30 malicious applications. The study compares extracting information while the application is running, re- 5 feature selection methods and 5 Machine Learning classi- spectively. Though it could increase the accuracy of the de- fiers KNN, Decision Tree, Multi Layer Perceptron (MLP), tection rate, it makes the system cumbersome and the analy- Random Forests, Naive Bayes. The applications are run in sis process time consuming. real devices and are monitored for system calls generated by (Shabtai(2010)) extracted opcodes from the executable Strace, an application used to log various system activities in and proposed a framework that monitors the device state at android systems. Then the features are selected by Informa- every instant such as CPU usage, number of packets sent tion Gain and Chi-Square. A set of 5 feature sets are devised over network, number of running process, battery level. Ap- and used to compare the efficiency of 5 Machine Learning plications are downloaded from play store. The authors ex- algorithms. The study concluded that the MLP achieves a amine the applicability of Knowledge Based Temporal Ab- highest accuracy and True Positive rate for one feature set straction (KBTA) which helps continuously monitor and while J48 Decision Tree achieves high performance rate for measure events on a mobile system. The study was con- another feature set. cluded with 94% detection rate with the feasibility of run- (Ng and Hwang(2014)) also used Strace to monitor the ning such a system with just 3% power consumption. The application for 60 secs. The features taken into account were authors also recommend the implementation of SELinux to Strace logged ProcessID, system calls, returned values and enhance the security mechanisms of Android. Efficiency of times between consecutive system calls. The no of times Machine Learning algorithms such as Decision Trees, Naive 19 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 Table 3: Top features used in Dynamic analysis Sl. Feature Machine Learning Algorithm No. 1 Network, SMS, Power Usage, CPU, Process info, Native Naive Bayes, Random Forest, SVM with and Dalvik Memory SMO algorithm 2 Data packets being sent, IP address, No. of active com- Random Forest munications, System calls 3 Process id, System calls collected by Strace, Returned Naive Bayes, Decision Trees, SVM values, Times between consecutive calls 4 Network Traffic - Destination IP address Classification 5 System calls collected Strace, Logs of System activities J48 Decision Trees, KNN, ST, Multi Layer Perceptron 6 Data collected by Logger, Internet traffic, Battery per- Naive Bayes, J48 Decision Trees centage, Temperature collected every minute Bayes, BayesNet, K-Means, Histogram and Logistic Re- licious payload onto the mobile systems. The authors con- gression are compared and evaluated. clude the research by giving out the analysis methodologies (Xu et al.(2013)Xu, Yu, Chen, Cao, Dong, Guo, and in detecting the malwares. Cao) proposes a system, MobSafe that combines the dy- (Lindorfer et al.(2015)Lindorfer, Neugschwandtner, and namic (Android Security Evaluation Framework - ASEF) Platzer) proposes a system MARVIN with large-scale An- and static (Static Android Analysis Framework - SAAF) droid malware analysis sandbox ANDRUBIS to provide analysis methods. They used 100,000 active android appli- users with a risk assessment for an application. They devel- cations from AppChina. Static features include the informa- oped an end user app into which users will submit their app tion from apk files and decoded smali files were analyzed and receive the score that tells the users how malicious the to extract the permissions, heuristic patterns, and program application is. MARVIN has 98.24% accuracy with less than slicing for functions of interest NO ML: analyzing takes 0.04% false positives. Static features such as permission, within 2mins and For dynamic analysis ADB logging and API Calls based on used-permission, reflection API, cryp- TCP DUMP were used. The application is launched on a tographic API, dynamic loading of code are combined with Virtual Machine and subjected to human level interaction dynamic features such as File operations, Network opera- simulation. This is then compared with a CVE library and tions, Phone events, Data leaks, Dynamically loaded code, its Internet activity with Google Safe Browser API to check dynamically registered broadcast receivers. SVM with a lin- the URLS the app requested is malicious or not. ear classifier is used as a model of classification. The au- (Wei et al.(2013)Wei, Zhang, Ge, and Hardy) analyzed 96 thors made use of labeled data set obtained from play-store, benign applications and 92 malware samples to extract static gnome project and used their system to classify samples features such as software profiles. Strace is used to record from VirusTotal. system calls along with the process ID while the application Table 4 summarizes the static an dynamic features com- is running for dynamic features. These information are col- bined and used as part of hybrid analysis. As seen from Ta- lected and applied over Support Vector Machones and Naive ble 1, Permissions is used mostly as a feature along with Bayes. dynamic features like Logged information, API call traces (Feldman et al.(2014)Feldman, Stadther, and Wang) pro- and Network Traffics. poses a system, Manilyzer which uses requested permis- sions, High Prior receivers, Low version numbers and Future Goals on Counteracting the Update abused services as features and test their model with 617 applications 307 malicious 310 benign applications. Effi- Attack ciency of Naive Bayes, SVM, K-Nearest Neighbours, J48 With this analysis, it can be seen that only very few re- Decision Trees are compared and concluded with saying the searches have been conducted which deals with counteract- most number of malware were labelled with 1.x application ing the update attack. As discussed in the previous section, version number. And also that high priority intent filter were the update attack is so hard to detect because with the previ- closely associated with SMS malware as 88% of the appli- ous version installed on the device is benign and it is not sure cations with this characteristics were malicious. Manilyzer when the malicious activity os performed. The key to detect is less effective but can be enhanced with other features as- update attack is to keep track of the functions of the pre- sociated with permissions such as API calls. Manilyzer is vious benign applications that are installed on the android effectively used to detect adware spywae and SMS malware. devices. When the application is updated we can find the (Hsieh et al.(2015)Hsieh, Wu, and Kao) studies and sum- difference between the old and recent versions of the appli- marizes the threat from malware on handheld devices, how cation and with combining the machine learning techniques malware writers evade the anti virus detection on mobile de- and the acquired knowledge from malicious malware files, vices and the techniques that were used to deliver the ma- we can easily detect the update attack and the malicious in- 20 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 Table 4: Top features used in Hybrid analysis Sl. Feature Machine Learning Algorithm No. 1 CPU Usage, No. of packets sent, No. of running process, Naive Bayes, Decision trees, Random For- Battery level est,BayesNet, K-Means, Logistic Regres- sion 2 Static: Information from apk, Decoded smali files Dy- Random Forest namic: ADB Logging, TCP Dump 3 Static: Static: Software profile Dynamic: Strace - system Naive Bayes, SVM calls and process id, 4 Static: Permission, High priority receivers, version num- Naive Bayes, SVM, K-NN, J48 Decision bers Trees 5 Static: Permission, API Calls based on used-permission, SVM with linear function reflection API, cryptographic API, Dynamic: loading, File operations, Network operations, Phone events, Data leaks, Dynamically loaded code, dynamically registered broadcast receivers tent of the malware author. P.P.K. Chan and Wen-Kai Song. Static detection of an- droid malware by using permissions and api calls. In Ma- chine Learning and Cybernetics (ICMLC), 2014 Interna- Figure 1: Counteracting the update attack tional Conference on, volume 1, pages 82–87, July 2014. doi: 10.1109/ICMLC.2014.7009096. Hsin-Yu Chuang and Sheng-De Wang. Machine learning based hybrid behavior models for android malware analy- sis. In Software Quality, Reliability and Security (QRS), 2015 IEEE International Conference on, pages 201–206, Aug 2015. doi: 10.1109/QRS.2015.37. M. Fazeen and R. Dantu. Another free app: Does it have the right intentions? In Privacy, Security and Trust (PST), 2014 Twelfth Annual International Conference on, pages 282–289, July 2014. doi: 10.1109/PST.2014.6890950. S. Feldman, D. Stadther, and Bing Wang. Manilyzer: Auto- mated android malware detection through manifest analysis. Conclusion In Mobile Ad Hoc and Sensor Systems (MASS), 2014 IEEE 11th International Conference on, pages 767–772, Oct 2014. This study summarizes recent developments in android mal- doi: 10.1109/MASS.2014.65. ware detection using machine learning algorithms. Detec- tion techniques and systems that uses static, dynamic and M. Ghorbanzadeh, Yang Chen, Zhongmin Ma, T.C. Clancy, hybrid approaches are discussed and highlighted. A method and R. McGwier. A neural network approach to category that could lead to potential counteracting the update attack validation of android applications. In Computing, Network- is discussed. The unavailability of a larger android malware ing and Communications (ICNC), 2013 International Con- dataset remains a great problem in evaluating various ap- ference on, pages 740–744, Jan 2013. doi: 10.1109/IC- proaches. With a proper dataset shared among researchers, a CNC.2013.6504180. system that learns a new malware and share that knowledge W. Glodek and R. Harang. Rapid permissions-based de- to all the mobile devices, so that they can protect themselves tection and analysis of mobile malware using random deci- from future attacks, could be developed. sion forests. In Military Communications Conference, MIL- COM 2013 - 2013 IEEE, pages 980–985, Nov 2013. doi: References 10.1109/MILCOM.2013.170. M.S. Alam and S.T. Vuong. Random forest classifica- Hyo-Sik Ham and Mi-Jung Choi. Analysis of android tion for detecting android malware. In Green Computing malware detection performance using machine learning and Communications (GreenCom), 2013 IEEE and Inter- classifiers. In ICT Convergence (ICTC), 2013 Interna- net of Things (iThings/CPSCom), IEEE International Con- tional Conference on, pages 490–495, Oct 2013. doi: ference on and IEEE Cyber, Physical and Social Comput- 10.1109/ICTC.2013.6675404. ing, pages 663–669, Aug 2013. doi: 10.1109/GreenCom- Wan-Chen Hsieh, Chuan-Chi Wu, and Yung-Wei Kao. A iThings-CPSCom.2013.122. study of android malware detection technology evolution. 21 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 In Security Technology (ICCST), 2015 International Car- tion of important features. In Communications and Network nahan Conference on, pages 135–140, Sept 2015. doi: Security (CNS), 2015 IEEE Conference on, pages 701–702, 10.1109/CCST.2015.7389671. Sept 2015. doi: 10.1109/CNS.2015.7346893. I. Ideses and A. Neuberger. Adware detection and privacy D.V. Ng and J.-I.G. Hwang. Android malware detection control in mobile devices. In Electrical Electronics En- using the dendritic cell algorithm. In Machine Learn- gineers in Israel (IEEEI), 2014 IEEE 28th Convention of, ing and Cybernetics (ICMLC), 2014 International Con- pages 1–5, Dec 2014. doi: 10.1109/EEEI.2014.7005849. ference on, volume 1, pages 257–262, July 2014. doi: F. Idrees and M. Rajarajan. Investigating the android intents 10.1109/ICMLC.2014.7009126. and permissions for malware detection. In Wireless and Mo- U. Pehlivan, N. Baltaci, C. Acarturk, and N. Baykal. The bile Computing, Networking and Communications (WiMob), analysis of feature selection methods and classification al- 2014 IEEE 10th International Conference on, pages 354– gorithms in permission based android malware detection. In 358, Oct 2014. doi: 10.1109/WiMOB.2014.6962194. Computational Intelligence in Cyber Security (CICS), 2014 Q. Jerome, K. Allix, R. State, and T. Engel. Us- IEEE Symposium on, pages 1–8, Dec 2014. doi: 10.1109/CI- ing opcode-sequences to detect malicious android appli- CYBS.2014.7013371. cations. In Communications (ICC), 2014 IEEE Interna- N. Peiravian and Xingquan Zhu. Machine learning for an- tional Conference on, pages 914–919, June 2014. doi: droid malware detection using permission and api calls. In 10.1109/ICC.2014.6883436. Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th Hwan-Hee Kim and Mi-Jung Choi. Linux kernel-based fea- International Conference on, pages 300–305, Nov 2013. ture selection for android malware detection. In Network doi: 10.1109/ICTAI.2013.53. Operations and Management Symposium (APNOMS), 2014 R. Raveendranath, V. Rajamani, A.J. Babu, and S.K. Datta. 16th Asia-Pacific, pages 1–4, Sept 2014. doi: 10.1109/AP- Android malware attacks and countermeasures: Current and NOMS.2014.6996540. future directions. In Control, Instrumentation, Communica- H. Kurniawan, Y. Rosmansyah, and B. Dabarsyah. Android tion and Computational Technologies (ICCICCT), 2014 In- anomaly detection system using machine learning classifi- ternational Conference on, pages 137–143, July 2014. doi: cation. In Electrical Engineering and Informatics (ICEEI), 10.1109/ICCICCT.2014.6992944. 2015 International Conference on, pages 288–293, Aug RiskIQ. Android malware attacks and countermeasures: 2015. doi: 10.1109/ICEEI.2015.7352512. Current and future directions. June 2014. M. Lindorfer, M. Neugschwandtner, and C. Platzer. Mar- J. Sahs and L. Khan. A machine learning approach to an- vin: Efficient and comprehensive mobile app classification droid malware detection. In Intelligence and Security Infor- through static and dynamic analysis. In Computer Soft- matics Conference (EISIC), 2012 European, pages 141–147, ware and Applications Conference (COMPSAC), 2015 IEEE Aug 2012. doi: 10.1109/EISIC.2012.34. 39th Annual, volume 2, pages 422–433, July 2015. doi: A.A.A. Samra, Kangbin Yim, and O.A. Ghanem. Analysis 10.1109/COMPSAC.2015.103. of clustering technique in android malware detection. In In- Wen Liu. Mutiple classifier system based android malware novative Mobile and Internet Services in Ubiquitous Com- detection. In Machine Learning and Cybernetics (ICMLC), puting (IMIS), 2013 Seventh International Conference on, 2013 International Conference on, volume 01, pages 57–62, pages 729–733, July 2013. doi: 10.1109/IMIS.2013.111. July 2013. doi: 10.1109/ICMLC.2013.6890444. B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, and P.G. Xing Liu and Jiqiang Liu. A two-layered permission-based Bringas. On the automatic categorisation of android applica- android malware detection scheme. In Mobile Cloud Com- tions. In Consumer Communications and Networking Con- puting, Services, and Engineering (MobileCloud), 2014 2nd ference (CCNC), 2012 IEEE, pages 149–153, Jan 2012. doi: IEEE International Conference on, pages 142–148, April 10.1109/CCNC.2012.6181075. 2014. doi: 10.1109/MobileCloud.2014.22. A. Shabtai. Malware detection on mobile devices. In Yu Lu, Pan Zulie, Liu Jingju, and Shen Yi. Android mal- Mobile Data Management (MDM), 2010 Eleventh Inter- ware detection technology based on improved bayesian clas- national Conference on, pages 289–290, May 2010. doi: sification. In Instrumentation, Measurement, Computer, 10.1109/MDM.2010.28. Communication and Control (IMCCC), 2013 Third Interna- A. Shabtai, Y. Fledel, and Y. Elovici. Automated static tional Conference on, pages 1338–1341, Sept 2013. doi: code analysis for classifying android applications using ma- 10.1109/IMCCC.2013.297. chine learning. In Computational Intelligence and Security M.Z. Mas’ud, S. Sahib, M.F. Abdollah, S.R. Selamat, and (CIS), 2010 International Conference on, pages 329–333, R. Yusof. Analysis of features selection and machine Dec 2010. doi: 10.1109/CIS.2010.77. learning classifier in android malware detection. In In- L. Tenenboim-Chekina, O. Barad, A. Shabtai, D. Mimran, formation Science and Applications (ICISA), 2014 Inter- L. Rokach, B. Shapira, and Y. Elovici. Detecting applica- national Conference on, pages 1–5, May 2014. doi: tion update attack on mobile devices through network fea- 10.1109/ICISA.2014.6847364. tur. In Computer Communications Workshops (INFOCOM A. Munoz, I. Martin, A. Guzman, and J.A. Hernandez. An- WKSHPS), 2013 IEEE Conference on, pages 91–92, April droid malware detection from google play meta-data: Selec- 2013. doi: 10.1109/INFCOMW.2013.6970755. 22 Balaji Baskaran and Anca Ralescu MAICS 2016 pp. 15–23 Te-En Wei, Ching-Hao Mao, A.B. Jeng, Hahn-Ming Lee, J. Xu, Y. Yu, Z. Chen, B. Cao, W. Dong, Y. Guo, and J. Cao. Horng-Tzer Wang, and Dong-Jie Wu. Android malware de- Mobsafe: cloud computing based forensic analysis for mas- tection via a latent network behavior analysis. In Trust, Secu- sive mobile applications using data mining. Tsinghua Sci- rity and Privacy in Computing and Communications (Trust- ence and Technology, 18(4):418–427, August 2013. doi: Com), 2012 IEEE 11th International Conference on, pages 10.1109/TST.2013.6574680. 1251–1258, June 2012. doi: 10.1109/TrustCom.2012.91. S.Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik. A new Yu Wei, Hanlin Zhang, Linqiang Ge, and R. Hardy. On android malware detection approach using bayesian classi- behavior-based detection of malware on android platform. fication. In Advanced Information Networking and Applica- In Global Communications Conference (GLOBECOM), tions (AINA), 2013 IEEE 27th International Conference on, 2013 IEEE, pages 814–819, Dec 2013. doi: 10.1109/GLO- pages 121–128, March 2013. doi: 10.1109/AINA.2013.88. COM.2013.6831173. S.Y. Yerima, S. Sezer, and G. McWilliams. Analysis of Westyarian, Y. Rosmansyah, and B. Dabarsyah. Malware bayesian classification-based approaches for android mal- detection on android smartphones using api class and ma- ware detection. Information Security, IET, 8(1):25–36, Jan chine learning. In Electrical Engineering and Informatics 2014a. ISSN 1751-8709. doi: 10.1049/iet-ifs.2013.0095. (ICEEI), 2015 International Conference on, pages 294–297, S.Y. Yerima, S. Sezer, and I. Muttik. Android malware de- Aug 2015. doi: 10.1109/ICEEI.2015.7352513. tection using parallel machine learning classifiers. In Next Zhao Xiaoyan, Fang Juan, and Wang Xiujuan. Android Generation Mobile Apps, Services and Technologies (NG- malware detection based on permissions. In Information MAST), 2014 Eighth International Conference on, pages and Communications Technologies (ICT 2014), 2014 In- 37–42, Sept 2014b. doi: 10.1109/NGMAST.2014.23. ternational Conference on, pages 1–5, May 2014. doi: 10.1049/cp.2014.0605. 23