=Paper= {{Paper |id=Vol-1584/paper19 |storemode=property |title=A Study of Android Malware Detection Techniques and Machine Learning |pdfUrl=https://ceur-ws.org/Vol-1584/paper19.pdf |volume=Vol-1584 |authors=Balaji Baskaran,Anca Ralescu |dblpUrl=https://dblp.org/rec/conf/maics/BaskaranR16 }} ==A Study of Android Malware Detection Techniques and Machine Learning== https://ceur-ws.org/Vol-1584/paper19.pdf
 Balaji Baskaran and Anca Ralescu                                  MAICS 2016                                                pp. 15–23




      A Study of Android Malware Detection Techniques and Machine Learning

                                               Balaji Baskaran and Anca Ralescu
                                                           EECS Department
                                                        University of Cincinnati
                                                      Cincinnati, OH 45221 - 0030
                                               baskarbi@mail.uc.edu, anca.ralescu@uc.edu



                            Abstract                                           Based on the current attack trends and anal-
                                                                            ysis of the present literatures, (Raveendranath
     Android OS is one of the widely used mobile Operat-
     ing Systems. The number of malicious applications and                  et al.(2014)Raveendranath, Rajamani, Babu, and Datta) lists
     adwares are increasing constantly on par with the num-                 the types of malwares as follows:
     ber of mobile devices. A great number of commercial                    1. Information Extraction
     signature based tools are available on the market which                   Compromises the device and steals personal information
     prevent to an extent the penetration and distribution of
                                                                               such as IMEI number, user’s personal information, etc.
     malicious applications. Numerous researches have been
     conducted which claims that traditional signature based                2. Automatic Calls and SMS
     detection system work well up to certain level and mal-                   User’s phone bill is increased by making calls and sending
     ware authors use numerous techniques to evade these                       SMS to some premium numbers
     tools. So given this state of affairs, there is an increas-
     ing need for an alternative, really tough malware de-                  3. Root Exploits
     tection system to complement and rectify the signature                    The malware will gain system root privileges and takes
     based system. Recent substantial research focused on                      control of the system and modifies the information.
     machine learning algorithms that analyze features from                 4. Search Engine Optimizations
     malicious application and use those features to classify
                                                                               Artificially search for a term and simulates clicks on tar-
     and detect unknown malicious applications. This study
     summarizes the evolution of malware detection tech-                       geted websites in order to increase the revenue of a search
     niques based on machine learning algorithms focused                       engine or increase the traffic on a website.
     on the Android OS.                                                     5. Dynamically Downloaded code
                                                                               An installed benign application downloads a malicious
                        Introduction                                           code and deploys it in the mobile devices.
According to a 2014 research study (RiskIQ(2014)), ma-                      6. Covert channel
licious applications in Google Play Store have increased                       A vulnerability in the devices that facilitates the informa-
388% between 2011 and 2013.                                                    tion leak between the processes that are not supposed to
   As the initial part of our research, we conducted an ex-                    share the information.
tensive study where we analyze the current trends and ap-                   7. Botnets
proaches on detecting the malwares on Android Systems us-                      A network of compromised mobile devices with a Bot-
ing Machine Learning techniques. The overall goal of this                      Master which is controlled by Command and Control
study is to identify the research so far on Android Mal-                       servers (C&C). Carry out Spam delivery, DDDos attacks
ware detections using Machine Leaning Techniques. With                         on the host devices.
this analysis we can formulate a defense mechanism specif-
ically to counteract the Update attack, the most difficult in-                 From this point on, the structure of the paper is as follows.
trusion technique to detect and eliminate.                                  Section is a general overview of current security deployed
   Update Attack: In Android update attack is defined as                    by play-store. Classification of various methods used in de-
the benign application installed in the system downloads                    tecting malwares in Android systems is presented in Section
malicious payloads while updating itself or downloads third                 . The paper concludes in Section
party malicious applications and installs in the system. This
type of attack is very hard to detect because the original ap-                    Overview of Android System Security
plication is benign. Unless we track the installed previous                 Google Play Store uses an in-house malicious applica-
versions and the application after the update we cannot de-                 tion detection system called Bouncer. But researchers have
tect the malicious activity. We aim to give a brief approach                proved that Bouncer’s ability to detect the malicious applica-
on counteracting the update attack with the survey on recent                tion is minimal and they could successfully publish a proto-
trends on Malware detection.                                                type malicious application in play-store. Android Play-Store




                                                                       15
 Balaji Baskaran and Anca Ralescu                           MAICS 2016                                                    pp. 15–23


uses application’s meta data such as user’s rating and user’s            permission, framework methods and framework classes for
comments to flag a malicious application. But by the time                their classification system.
the malicious application is detected, it could have made                   (Sanz et al.(2012)Sanz, Santos, Laorden, Ugarte-Pedrero,
enough damage to the affected mobile system.                             and Bringas) extracted the strings in the application, permis-
   Malware authors use many techniques to evade the detec-               sions, user rating, number of ratings, size of the application
tion such as (a) code obfuscation technique, (b) encryption,             and used Bayesian Networks, J48 Decision Tree and Ran-
(c) including permissions which are not needed by the appli-             dom Forest, SVM with SMO kernel. A total of 820 samples
cation, (d) requesting for unwanted hardwares, (e) download              were used to test and the authors concluded that they could
or update attack in which a benign application updates itself            achieve a very high accuracy with less false positive rate.
or another application now with malicious payload, which                    (Ghorbanzadeh et al.(2013)Ghorbanzadeh, Chen, Ma,
is very tough to detect. This also encourages the need for               Clancy, and McGwier) used Neural Networks to detect an
new researches on the detection techniques, including ma-                application’s category from permissions by means of multi
chine learning based techniques. Many studies have shown                 layered feed forward networks. A feed forward Neural Net-
that machine learning algorithms to detect the malicious ac-             work is built with two layers each containing 10 neurons.
tivities are successful in detecting them with very high accu-           The hidden layer contains sigmoid transfer function and the
racy.                                                                    output linear transfer functions was deployed. The suthoud
                                                                         assumed that the permissions declared in the manifest file
             Android Malware Detection                                   may be manipulated by the malware authors and they may
                                                                         misrepresent the categories declared in the manifest. So to
Based on the features used to classify an application, we can            simulate this property, the authors permuted permissions of
categorize the analysis as Static and Dynamic. Static anal-              50% of the test data and fed into the network.
ysis is done without running an application. Examples of                    (Yerima et al.(2013)Yerima, Sezer, McWilliams, and
static features include, (a) permissions, (b) API calls which            Muttik) used 2000 applications with 1000 malicious and
can be extracted from the AndroidManifest.xml file. Dy-                  1000 benign applications. They extracted features like Per-
namic analysis deals with features that were extracted from              missions, API calls, Native Linux System commands and
the application while running, including (a) network traffic,            various features from manifest and class files. The mal-
(b) battery usage, (c) IP address, etc. The third type of anal-          ware authors embed native linux commads such as chnown,
ysis is hybrid analysis which combines the features from                 mount, remount, etc., and run them in the Android system
static and dynamic techniques. The rest of this section de-              when the application is launched. Mutual information (en-
scribes the features extracted from the application and ma-              tropy) is used to rank the features and then a Bayesian Clas-
chine learning algorithm used.                                           sifier is used for classification.
                                                                            (Samra et al.(2013)Samra, Yim, and Ghanem) extracted
Static Analysis                                                          features from AndroidManifest.xml such as count of xml el-
In static analysis, the features are extracted from the appli-           ements, application specific information such as name, cat-
cation file without executing the application. This method-              egory, description, rating, package info, description, rating
ology is resource and time efficient as the application is not           values, rating counts and price. The information from 18174
executed. But at the same time, this analysis suffers from               android application with 4612 business category and 13535
code obfuscation techniques the Malware authors employ to                tools were extracted by using web crawlers. They were clus-
evade from static detection techniques. One of very popular              tered using K-Means clustering.
evasion technique is the Update Attack: a benign applica-                   (Peiravian and Zhu(2013)) utilized permission, API calls
tion is installed on the mobile device and when the appli-               and the combination of both as features. The two types of
cation gets an update, the malicious content is downloaded               permissions in Android, requested permission and required
and installed as part of the update. This cannot be detected             permission are used to express an application as a binary
by static analysis techniques which will scan only the benign            vector where Pi = 1 iff the Manifest.xml has the ith per-
application.                                                             mission. Same as permission, API calls are also expressed
   The most commonly used static features are the Permis-                as a binary vector with AP Ii = 1 iff there is the API call
sion and API calls. Since these are extracted from the appli-            made in the application. These two features are concatenated
cation AndroidManifest.xml and influence the malware de-                 and the third feature is formed. A total of 2510 samples in-
tection rate to a high extent, extensive research has been               cluding 1260 are malicious and 1250 benign are used. The
made with these as features as well as combined with other               authors concluded that Bagging, an ensemble classification
features extracted from meta-data available in Google Play-              method has the best performance in classifying all created
Store such as version name, version no., author’s name, last             datasets.
updated time, etc.,                                                         (Liu(2013)) investigated three specific types of malware:
   (Sahs and Khan(2012)) used permissions and Control                    SMS-related, control-related and spy-related. An applica-
Flow Graphs(CFG) as features and used One-class Support                  tion’s permission and ¡uses-feature¿ xml tag which requests
Vector Machine(SVM). The most of training data are benign                the necessary hardware devices needed to run the applica-
applications and the classifier will classify a sample as mali-          tion, is extracted and used as features. Information Gain is
cious only if it is sufficiently different from the benign class.        used to select important features and SVM with the basic
   (Shabtai et al.(2010)Shabtai, Fledel, and Elovici) used               classifier is used to detect the malicious application. The au-




                                                                    16
 Balaji Baskaran and Anca Ralescu                         MAICS 2016                                                    pp. 15–23


thors could detect Spy-related applications with an accuracy           applications. A total of 28,548 benign applications and 1,536
of 81%, SMS-related with malicious applications with an                malicious applications and permission pairs i.e., combina-
accuracy of 97% and Control-related malicious applications             tion of any two requested permission are analyzed. The two
with an 100% accuracy and could detect benign applications             layered approach helped to balance the detection accuracy
with an accuracy of 88%.                                               and detection speed of the classifier. In Phase 1, requested
   (Glodek and Harang(2013)) constructed five Random                   permissions and the J48 Decision Tree algorithm is used in
Forests with 5-fold cross validation and compared their per-           detection and in Phase 2, requested permission pairs and the
formance in detecting malicious applications. They have                J48 Decision Tree is used for detection. If there is any con-
used 500 malicious and 500 benign from North Carolina                  tradiction in the results obtained from both the phases, used
State University’s malware project. Permission, broadcast              permission pairs and J48 is used to classify again. The au-
receivers and native code embedded in the application are              thors achieved a good result with this approach and recom-
used as features and they concluded that their method out-             mended using the permission in component level than the
performs a lot of commercial anti virus detection tools.               application level for better detection of malicious activities.
   (Jerome et al.(2014)Jerome, Allix, State, and Engel) ex-               (Ideses and Neuberger(2014)) used permission, broadcast
tracted the opcodes from class.dex file and translated into            receivers and activities, byte code fragments, system-calls
opcode sequences, binary sequences of k-grams that charac-             as features and trained SVM with the training dataset. The
terize the least functionalities required by a program. They           researchers tested their proposed Malware detection system
trained their model with Gnome Project dataset and ran-                with a security tester for benchmarking where their system
domly picked 1246 applications from the Google Play Store.             was tested with 7,000 samples. They conclude that their sys-
The test dataset consists of 25,476 malware samples, 15670             tem could achieve about 99.3% positive rate with just 0.14%
benign applications from VirusTotal. Information Gain was              false alarm rate.
used to select important features among the available ones.               (Yerima et al.(2014a)Yerima, Sezer, and McWilliams)
The author used a linear implementation of SVM to clas-                presented and analyzed three Bayesian classification ap-
sify application samples. The results were compared with               proaches for detecting Android malwares. Permissions and
the detection rate of 25 anti virus tools. The study release an        code based properties such as API calls, both Java system
interesting signature patterns of Malware, Goodware, False             based and Android system based, Linux and Android system
Positives and False Negatives of their classifier. The false           commands are also extracted from the sample applications.
negatives were found out to be adwares and they were also              A list of top 20 permissions and top 25 API calls used by
considered a threat by the tool.                                       benign and malicious applications are presented.
   (Pehlivan et al.(2014)Pehlivan, Baltaci, Acarturk, and                 (Fazeen and Dantu(2014)) used combines Intentions of
Baykal) used 3748 application packages, developed C#                   the applications esp., Task Intentions with permission as fea-
scripts to automatically extract about 182 attributes that in-         ture in developing their model. At first the requested per-
clude Permissions, version no and version name of the appli-           missions are extracted and a histogram is constructed for
cations. The study compared feature selection methods such             that task-intention category. Normalizing this results in an
as Gain Ratio Attribute Evaluator, Relief Attribute Evalua-            I shaped PMF. This shape is used to compare and detect the
tor, Control Flow Subset Evaluator, and Consistency Subset             unknown applications as benign or malicious based on their
Evaluator and machine learning algorithms Bayesian clas-               Task Intentions. The system works as follows:
sification, Classification and Regression Tree (CART), J48             • Phase I trains and uses machine learning algorithms to
DT, RF, SMO. Using the feature selection methods, they                   find the task intentions of the sample applications.
came up with 97 features that could represent the whole
dataset. Finally the authors conclude that, with just 25 fea-          • Phase II uses the knowledge from Phase I to find the task
tures, the Control Flow Subset Evaluator selection gave a                intention of an unknown application and classify as be-
good performance and Random forest and J48 performed                     nign or malicious. The I shape is compared with the re-
better than Bayesian classifier.                                         quested permission by using a using a matching ratio, that
   (Chan and Song(2014)) analyzed 796 benign and 175 ma-                 is generated by a machine learning algorithm. If the ratio
licious applications for their study. Permissions used from              is in a threshold, then the application is potentially safe.
the manifest.xml file and API call info from the classes.dex             The authors used Naive Bayes, Multi Layered Perceptron
file are extracted and with Information Gain they selected               and Random Forests and compared their performances.
a set of 19 relevant API calls. They compared the results                 (Xiaoyan et al.(2014)Xiaoyan, Juan, and Xiujuan) ex-
obtained by machine learning algorithms such as Naive                  tracted permissions from the manifest and represented as a
Bayes, SVM with SMO algorithm, RBF Network, Multi                      binary vector. Then Principle Component Analysis (PCA)
Layer Perceptron, Liblinear, J48 decision tree and Random              is performed to select the best features. A linear SVM is
Forests.The authors concluded that the were able to get 90%            trained to classify the app samples. The author compares the
of the accuracy by using the API calls and permission com-             result with other classifiers such as J48 Decision Tree, Naive
bined than using the individual features alone.                        Bayes, BayesNet, CART, RandomForest and concludes that
   (Liu and Liu(2014)) combined the two types of permis-               SVM gives a better performance.
sions, required permission and requested permission and de-               (Yerima et al.(2014b)Yerima, Sezer, and Muttik) came
signed a two layer approach with these features and em-                up with a parallel implementation of their system to detect
ployed machine learning algorithms to detect the malicious             malicious android applications. They used application re-




                                                                  17
 Balaji Baskaran and Anca Ralescu                           MAICS 2016                                                     pp. 15–23


lated feature such as permissions, Standard OS and android
framework commands. They developed parallel implemen-                          Table 1: Topmost used features in static analysis
tation of Logistic function based classifier, Naive Bayes                              Sl. No.           Feature
- probabilistic method and PART, RIDOR which are rule                                     1            Permission
based classifier. with the features extracted, the classification                         2             API calls
is performed with the individual algorithms and then paral-                               3         Strings extracted
lel implementation is carried out. The maximum probability                                4        Native commands
scheme fetched an accuracy of 97.5%.                                                      5          XML elements
   (Idrees and Rajarajan(2014)) combines permissions and                                  6             Meta data
Intents and used 292 applications for training and 340 for                                7       Opcodes from .dex file
testing their model. The study describes some usage statis-                               8           Task Intents
tics of benign and malicious applications with regards to
intents and permissions and developed Naive Bayes, Kstar,
Prism to detect the maliciuos applications from benign ap-               Table 2: Top features combined with other features in static
plications.                                                              analysis
   (Munoz et al.(2015)Munoz, Martin, Guzman, and Her-                        Feature                          Combined With
nandez) The authors collected the information from Google                                                   Broadcast receivers
Play meta-data such as intrinsic application features, Appli-                                                Uses-feature tag
cation category, Developer related feature, certificate related              Permissions                   Android OS commands
feature, social related feature. They concluded that certifi-                                                    API calls
cate and developer information, intrinsic application feature                                                    meta-data
are the most promising feature to determine a malware with                                                        opcodes
just meta data.                                                              Features extracted from
   (Westyarian et al.(2015)Westyarian, Rosmansyah, and                       manifest files and class              API calls
Dabarsyah) used 205 benign and 207 malicious applica-                        files
tion files and extracted API calls that are only related to
the permission declared in ¡used-permission¿ label in man-
ifest.xml file. The study concluded that 97% of the mal-
ware requests telephonyManager and connectivityManager
                                                                         Dynamic Analysis
are the most important features. Random forest classifica-               (Wei et al.(2012)Wei, Mao, Jeng, Lee, Wang, and Wu) used
tion obtains 92.4% with cross validation as feature selection            Droidbox, a tool to monitor the application real time, to dy-
algorithm and SVM obtain 91.4% with percentage split as                  namically analyze the behavior of android applications. IP
feature selection algorithm.                                             address of the source is extracted from the network traffic
                                                                         after then application is run in a sandbox environment. The
   (Chuang and Wang(2015)) collected API calls from be-                  research concentrated only on the network characteristics of
nign application separately and API calls from malicious                 the malwares leveraging the fact that they will find their next
applications separately and used these as features for clas-             target soon. The extracted IP address is used to find the spa-
sifying an unknown sample. The APIs in the unknown are                   tial address using external services and to determine the uni-
ranked according to their difference in the number of occur-             formity of geographic distribution of the hosts because in-
rences in benign and number of occurrences in malicious ap-              fected hosts will be distributed worldwide. After extracting
plications. Then they deploy single a model approach where               the features, a M xN APP-GEO Matrix is constructed with
they will combine the two feature sets into a single vector.             M representing the android applications(rows) and N net-
In Malicious model approach only the hypothesis from Ma-                 work features. ICA (Independent Component Analysis) to
licious tended APIs is used for classification. The Hybrid ap-           extract the latent concept or sparse from the noisy spamming
proach combines two separately trained SVM models. These                 data. The researchers used Weka and FastICA, the two open
results are then compared to predict whether the unknown                 source libraries to evaluate their model. A total of 310 mal-
sample is malicious. The Hybrid model behaved much bet-                  ware samples were used and they could achieve about 93%
ter than the Malicious model but the single model obtained               accuracy rate.
from combined features outperformed the Malicious model.                    (Ham and Choi(2013)) used 30 normal apps and 5 mal-
   Table 1 shows the top frequent used features in static anal-          ware samples (GoldDream, PJApps, DroidKungFu2, Snake
ysis. Table 2 summarizes the top features that are combined              and Angry Birds Rio Unlocker) in this study. The allocated
with other features to produce better detection rate. By ob-             resources when the app starts are monitored and the behav-
serving the table 1 and table 2, it can be clearly seen that             ioral pattern is extracted. hese resource data are stored within
Permission and API calls, the two features extracted from                the device and are converted into feature vectors. Each fea-
Manifest file and .dex file produces higher detection rate and           ture is subdivided to 7 categories, 1. Network, SMS, CPU,
inorder to make them more fail safe these can be combined                and power usage, Process ( like ID, Name , running pro-
with other features such as mate-data collected from Google              cess), memory Native, Dalvik and other and Virtual Mem-
Play Store or the features extracted from the XML elements.              ory.32 features are related to malware detection and applied




                                                                    18
 Balaji Baskaran and Anca Ralescu                         MAICS 2016                                                    pp. 15–23


Information Gain to select features. They used Naive Bayes,            each invoked call is counted. PCA is used in selecting the
Random Forest LR, SVM with 10 fold cross validation.                   important feature and then the classifier classifies the appli-
   The authors concluded that Naive Bayes/LRs confusion                cation sample malicious or benign based on anomaly score
matrix are irregular in distribution with these features. SVM          obtained by the input. The author compared their system’s
correctly classified normal type data almost 100% but falsely          performance with classifiers such as Naive Bayes, J48 De-
detected malicious applications as benign. Random Forests              cision Tree and SVM and claims that they could achieve
outperformed all the algorithms and correctly classifies the           98.4% detection rate.
majority of normal and malware applications.                              (Kim and Choi(2014)) Linux based features are extracted
   (Lu et al.(2013)Lu, Zulie, Jingju, and Yi) compared                 from the Android Os and used as feature to detect malicious
Bayesian method alone and Bayesian method combined                     applications. There were 59 features obtained like, Mem-
with Chi Square feature selection method results are com-              ory, CPU, Network, etc. 6 malwares were run on the system
pared to evaluate the performance of the two ML algo-                  and the system is monitored to collect the above said fea-
rithms. The study concluded that Bayesian method with Chi              tures. Every 10 seconds the data is collected and sent over
Squared yielded an accuracy of 89% while Bayesian method               to a server and the server does the classification. Out of 59
alone yielded 80%.                                                     features, 36 are selected and the results are compared be-
   (Tenenboim-Chekina et al.(2013)Tenenboim-Chekina,                   fore and after applying feature selection. It has been said
Barad, Shabtai, Mimran, Rokach, Shapira, and Elovici) used             that the feature selection improves the accuracy and reduces
5 to 10 self-written Trojan malware with two versions of the           the False Positive Rate of the classification.
malware, one benign and other malicious which is repacked                 (Kurniawan et al.(2015)Kurniawan, Rosmansyah, and
version of the benign with malicious code. While the ap-               Dabarsyah) used Logger, a default application which is in-
plication is running, Many network based features are ex-              built in Android was used to extract the sum of Internet
tracted. The self-written applications are installed in the de-        traffic, percentage of battery used and battery temperature
vices and their behavior was collected and analyzed. This              for every minute. These information collected as set of fea-
helps the traffic patterns distinguishable from benign and             tures and is fed into weka, a open source learning library
malicious. Feature measurements are performed at fixed                 for testing and training with Naive Bayes, J48 decision tree
time intervals and then aggregation functions are computed             and Random Forest algorithms. The author concluded that
over these measurements. Cross feature analysis is used                Random Forest has high accuracy of 85.6% with these fea-
to explore the correlation between features. The deviations            tures and proposes other features that can be combined with
caused by abnormal activities from normal activities are ob-           existing system to improve the accuracy.
served. With labeled samples a threshold of deviation is ob-              Table 3 summarizes the most frequently used features in
tained during the algorithm formulation. The study could               Dynamic analysis. As seen, Network traffic which includes
successfully detect the repacked malicious applications us-            data packets sent, and other behavioral patterns can lead to
ing the network features learned.                                      quick detection of malicious activity. Tracing the IP address
   (Alam and Vuong(2013)) rooted the mobile device to get              can help us to get the geographical landscape of the attack
the details such as, data being sent by applications, IP ad-           surface. Other than this, SMS, information logged by Log-
dress being communicated, number of active communica-                  ger and Strace is very much helpful in achieving a higher
tions, the system calls and used Random forest with 1330               detection rate.
malicious and 407 benign applications. The authors con-
cluded that with more trees and less feature per tree in the           Hybrid Analysis
Random Forest, they could achieve an accuracy of 99%.
   (Mas’ud et al.(2014)Mas’ud, Sahib, Abdollah, Selamat,               The hybrid methodology involves combining static and dy-
and Yusof) monitored the system call of 30 normal appli-               namic features collected from analyzing the application and
cations and 30 malicious applications. The study compares              extracting information while the application is running, re-
5 feature selection methods and 5 Machine Learning classi-             spectively. Though it could increase the accuracy of the de-
fiers KNN, Decision Tree, Multi Layer Perceptron (MLP),                tection rate, it makes the system cumbersome and the analy-
Random Forests, Naive Bayes. The applications are run in               sis process time consuming.
real devices and are monitored for system calls generated by              (Shabtai(2010)) extracted opcodes from the executable
Strace, an application used to log various system activities in        and proposed a framework that monitors the device state at
android systems. Then the features are selected by Informa-            every instant such as CPU usage, number of packets sent
tion Gain and Chi-Square. A set of 5 feature sets are devised          over network, number of running process, battery level. Ap-
and used to compare the efficiency of 5 Machine Learning               plications are downloaded from play store. The authors ex-
algorithms. The study concluded that the MLP achieves a                amine the applicability of Knowledge Based Temporal Ab-
highest accuracy and True Positive rate for one feature set            straction (KBTA) which helps continuously monitor and
while J48 Decision Tree achieves high performance rate for             measure events on a mobile system. The study was con-
another feature set.                                                   cluded with 94% detection rate with the feasibility of run-
   (Ng and Hwang(2014)) also used Strace to monitor the                ning such a system with just 3% power consumption. The
application for 60 secs. The features taken into account were          authors also recommend the implementation of SELinux to
Strace logged ProcessID, system calls, returned values and             enhance the security mechanisms of Android. Efficiency of
times between consecutive system calls. The no of times                Machine Learning algorithms such as Decision Trees, Naive




                                                                  19
 Balaji Baskaran and Anca Ralescu                        MAICS 2016                                                    pp. 15–23


                                         Table 3: Top features used in Dynamic analysis
      Sl.       Feature                                                     Machine Learning Algorithm
      No.
      1         Network, SMS, Power Usage, CPU, Process info, Native           Naive Bayes, Random Forest, SVM with
                and Dalvik Memory                                              SMO algorithm
      2         Data packets being sent, IP address, No. of active com-        Random Forest
                munications, System calls
      3         Process id, System calls collected by Strace, Returned         Naive Bayes, Decision Trees, SVM
                values, Times between consecutive calls
      4         Network Traffic - Destination IP address                       Classification
      5         System calls collected Strace, Logs of System activities       J48 Decision Trees, KNN, ST, Multi Layer
                                                                               Perceptron
      6         Data collected by Logger, Internet traffic, Battery per-       Naive Bayes, J48 Decision Trees
                centage, Temperature collected every minute


Bayes, BayesNet, K-Means, Histogram and Logistic Re-                  licious payload onto the mobile systems. The authors con-
gression are compared and evaluated.                                  clude the research by giving out the analysis methodologies
   (Xu et al.(2013)Xu, Yu, Chen, Cao, Dong, Guo, and                  in detecting the malwares.
Cao) proposes a system, MobSafe that combines the dy-                    (Lindorfer et al.(2015)Lindorfer, Neugschwandtner, and
namic (Android Security Evaluation Framework - ASEF)                  Platzer) proposes a system MARVIN with large-scale An-
and static (Static Android Analysis Framework - SAAF)                 droid malware analysis sandbox ANDRUBIS to provide
analysis methods. They used 100,000 active android appli-             users with a risk assessment for an application. They devel-
cations from AppChina. Static features include the informa-           oped an end user app into which users will submit their app
tion from apk files and decoded smali files were analyzed             and receive the score that tells the users how malicious the
to extract the permissions, heuristic patterns, and program           application is. MARVIN has 98.24% accuracy with less than
slicing for functions of interest NO ML: analyzing takes              0.04% false positives. Static features such as permission,
within 2mins and For dynamic analysis ADB logging and                 API Calls based on used-permission, reflection API, cryp-
TCP DUMP were used. The application is launched on a                  tographic API, dynamic loading of code are combined with
Virtual Machine and subjected to human level interaction              dynamic features such as File operations, Network opera-
simulation. This is then compared with a CVE library and              tions, Phone events, Data leaks, Dynamically loaded code,
its Internet activity with Google Safe Browser API to check           dynamically registered broadcast receivers. SVM with a lin-
the URLS the app requested is malicious or not.                       ear classifier is used as a model of classification. The au-
   (Wei et al.(2013)Wei, Zhang, Ge, and Hardy) analyzed 96            thors made use of labeled data set obtained from play-store,
benign applications and 92 malware samples to extract static          gnome project and used their system to classify samples
features such as software profiles. Strace is used to record          from VirusTotal.
system calls along with the process ID while the application             Table 4 summarizes the static an dynamic features com-
is running for dynamic features. These information are col-           bined and used as part of hybrid analysis. As seen from Ta-
lected and applied over Support Vector Machones and Naive             ble 1, Permissions is used mostly as a feature along with
Bayes.                                                                dynamic features like Logged information, API call traces
   (Feldman et al.(2014)Feldman, Stadther, and Wang) pro-             and Network Traffics.
poses a system, Manilyzer which uses requested permis-
sions, High Prior receivers, Low version numbers and                    Future Goals on Counteracting the Update
abused services as features and test their model with 617
applications 307 malicious 310 benign applications. Effi-                                Attack
ciency of Naive Bayes, SVM, K-Nearest Neighbours, J48                 With this analysis, it can be seen that only very few re-
Decision Trees are compared and concluded with saying the             searches have been conducted which deals with counteract-
most number of malware were labelled with 1.x application             ing the update attack. As discussed in the previous section,
version number. And also that high priority intent filter were        the update attack is so hard to detect because with the previ-
closely associated with SMS malware as 88% of the appli-              ous version installed on the device is benign and it is not sure
cations with this characteristics were malicious. Manilyzer           when the malicious activity os performed. The key to detect
is less effective but can be enhanced with other features as-         update attack is to keep track of the functions of the pre-
sociated with permissions such as API calls. Manilyzer is             vious benign applications that are installed on the android
effectively used to detect adware spywae and SMS malware.             devices. When the application is updated we can find the
   (Hsieh et al.(2015)Hsieh, Wu, and Kao) studies and sum-            difference between the old and recent versions of the appli-
marizes the threat from malware on handheld devices, how              cation and with combining the machine learning techniques
malware writers evade the anti virus detection on mobile de-          and the acquired knowledge from malicious malware files,
vices and the techniques that were used to deliver the ma-            we can easily detect the update attack and the malicious in-




                                                                 20
 Balaji Baskaran and Anca Ralescu                        MAICS 2016                                                 pp. 15–23


                                          Table 4: Top features used in Hybrid analysis
      Sl.       Feature                                                      Machine Learning Algorithm
      No.
      1         CPU Usage, No. of packets sent, No. of running process,       Naive Bayes, Decision trees, Random For-
                Battery level                                                 est,BayesNet, K-Means, Logistic Regres-
                                                                              sion
      2         Static: Information from apk, Decoded smali files Dy-         Random Forest
                namic: ADB Logging, TCP Dump
      3         Static: Static: Software profile Dynamic: Strace - system     Naive Bayes, SVM
                calls and process id,
      4         Static: Permission, High priority receivers, version num-     Naive Bayes, SVM, K-NN, J48 Decision
                bers                                                          Trees
      5         Static: Permission, API Calls based on used-permission,       SVM with linear function
                reflection API, cryptographic API, Dynamic: loading,
                File operations, Network operations, Phone events, Data
                leaks, Dynamically loaded code, dynamically registered
                broadcast receivers


tent of the malware author.                                           P.P.K. Chan and Wen-Kai Song. Static detection of an-
                                                                      droid malware by using permissions and api calls. In Ma-
                                                                      chine Learning and Cybernetics (ICMLC), 2014 Interna-
          Figure 1: Counteracting the update attack                   tional Conference on, volume 1, pages 82–87, July 2014.
                                                                      doi: 10.1109/ICMLC.2014.7009096.
                                                                      Hsin-Yu Chuang and Sheng-De Wang. Machine learning
                                                                      based hybrid behavior models for android malware analy-
                                                                      sis. In Software Quality, Reliability and Security (QRS),
                                                                      2015 IEEE International Conference on, pages 201–206,
                                                                      Aug 2015. doi: 10.1109/QRS.2015.37.
                                                                      M. Fazeen and R. Dantu. Another free app: Does it have
                                                                      the right intentions? In Privacy, Security and Trust (PST),
                                                                      2014 Twelfth Annual International Conference on, pages
                                                                      282–289, July 2014. doi: 10.1109/PST.2014.6890950.
                                                                      S. Feldman, D. Stadther, and Bing Wang. Manilyzer: Auto-
                                                                      mated android malware detection through manifest analysis.
                       Conclusion                                     In Mobile Ad Hoc and Sensor Systems (MASS), 2014 IEEE
                                                                      11th International Conference on, pages 767–772, Oct 2014.
This study summarizes recent developments in android mal-             doi: 10.1109/MASS.2014.65.
ware detection using machine learning algorithms. Detec-
tion techniques and systems that uses static, dynamic and             M. Ghorbanzadeh, Yang Chen, Zhongmin Ma, T.C. Clancy,
hybrid approaches are discussed and highlighted. A method             and R. McGwier. A neural network approach to category
that could lead to potential counteracting the update attack          validation of android applications. In Computing, Network-
is discussed. The unavailability of a larger android malware          ing and Communications (ICNC), 2013 International Con-
dataset remains a great problem in evaluating various ap-             ference on, pages 740–744, Jan 2013. doi: 10.1109/IC-
proaches. With a proper dataset shared among researchers, a           CNC.2013.6504180.
system that learns a new malware and share that knowledge             W. Glodek and R. Harang. Rapid permissions-based de-
to all the mobile devices, so that they can protect themselves        tection and analysis of mobile malware using random deci-
from future attacks, could be developed.                              sion forests. In Military Communications Conference, MIL-
                                                                      COM 2013 - 2013 IEEE, pages 980–985, Nov 2013. doi:
                       References                                     10.1109/MILCOM.2013.170.
M.S. Alam and S.T. Vuong. Random forest classifica-                   Hyo-Sik Ham and Mi-Jung Choi. Analysis of android
tion for detecting android malware. In Green Computing                malware detection performance using machine learning
and Communications (GreenCom), 2013 IEEE and Inter-                   classifiers. In ICT Convergence (ICTC), 2013 Interna-
net of Things (iThings/CPSCom), IEEE International Con-               tional Conference on, pages 490–495, Oct 2013. doi:
ference on and IEEE Cyber, Physical and Social Comput-                10.1109/ICTC.2013.6675404.
ing, pages 663–669, Aug 2013. doi: 10.1109/GreenCom-                  Wan-Chen Hsieh, Chuan-Chi Wu, and Yung-Wei Kao. A
iThings-CPSCom.2013.122.                                              study of android malware detection technology evolution.




                                                                 21
 Balaji Baskaran and Anca Ralescu                       MAICS 2016                                                 pp. 15–23


In Security Technology (ICCST), 2015 International Car-              tion of important features. In Communications and Network
nahan Conference on, pages 135–140, Sept 2015. doi:                  Security (CNS), 2015 IEEE Conference on, pages 701–702,
10.1109/CCST.2015.7389671.                                           Sept 2015. doi: 10.1109/CNS.2015.7346893.
I. Ideses and A. Neuberger. Adware detection and privacy             D.V. Ng and J.-I.G. Hwang. Android malware detection
control in mobile devices. In Electrical Electronics En-             using the dendritic cell algorithm. In Machine Learn-
gineers in Israel (IEEEI), 2014 IEEE 28th Convention of,             ing and Cybernetics (ICMLC), 2014 International Con-
pages 1–5, Dec 2014. doi: 10.1109/EEEI.2014.7005849.                 ference on, volume 1, pages 257–262, July 2014. doi:
F. Idrees and M. Rajarajan. Investigating the android intents        10.1109/ICMLC.2014.7009126.
and permissions for malware detection. In Wireless and Mo-           U. Pehlivan, N. Baltaci, C. Acarturk, and N. Baykal. The
bile Computing, Networking and Communications (WiMob),               analysis of feature selection methods and classification al-
2014 IEEE 10th International Conference on, pages 354–               gorithms in permission based android malware detection. In
358, Oct 2014. doi: 10.1109/WiMOB.2014.6962194.                      Computational Intelligence in Cyber Security (CICS), 2014
Q. Jerome, K. Allix, R. State, and T. Engel.             Us-         IEEE Symposium on, pages 1–8, Dec 2014. doi: 10.1109/CI-
ing opcode-sequences to detect malicious android appli-              CYBS.2014.7013371.
cations. In Communications (ICC), 2014 IEEE Interna-                 N. Peiravian and Xingquan Zhu. Machine learning for an-
tional Conference on, pages 914–919, June 2014. doi:                 droid malware detection using permission and api calls. In
10.1109/ICC.2014.6883436.                                            Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th
Hwan-Hee Kim and Mi-Jung Choi. Linux kernel-based fea-               International Conference on, pages 300–305, Nov 2013.
ture selection for android malware detection. In Network             doi: 10.1109/ICTAI.2013.53.
Operations and Management Symposium (APNOMS), 2014                   R. Raveendranath, V. Rajamani, A.J. Babu, and S.K. Datta.
16th Asia-Pacific, pages 1–4, Sept 2014. doi: 10.1109/AP-            Android malware attacks and countermeasures: Current and
NOMS.2014.6996540.                                                   future directions. In Control, Instrumentation, Communica-
H. Kurniawan, Y. Rosmansyah, and B. Dabarsyah. Android               tion and Computational Technologies (ICCICCT), 2014 In-
anomaly detection system using machine learning classifi-            ternational Conference on, pages 137–143, July 2014. doi:
cation. In Electrical Engineering and Informatics (ICEEI),           10.1109/ICCICCT.2014.6992944.
2015 International Conference on, pages 288–293, Aug                 RiskIQ. Android malware attacks and countermeasures:
2015. doi: 10.1109/ICEEI.2015.7352512.                               Current and future directions. June 2014.
M. Lindorfer, M. Neugschwandtner, and C. Platzer. Mar-               J. Sahs and L. Khan. A machine learning approach to an-
vin: Efficient and comprehensive mobile app classification           droid malware detection. In Intelligence and Security Infor-
through static and dynamic analysis. In Computer Soft-               matics Conference (EISIC), 2012 European, pages 141–147,
ware and Applications Conference (COMPSAC), 2015 IEEE                Aug 2012. doi: 10.1109/EISIC.2012.34.
39th Annual, volume 2, pages 422–433, July 2015. doi:                A.A.A. Samra, Kangbin Yim, and O.A. Ghanem. Analysis
10.1109/COMPSAC.2015.103.                                            of clustering technique in android malware detection. In In-
Wen Liu. Mutiple classifier system based android malware             novative Mobile and Internet Services in Ubiquitous Com-
detection. In Machine Learning and Cybernetics (ICMLC),              puting (IMIS), 2013 Seventh International Conference on,
2013 International Conference on, volume 01, pages 57–62,            pages 729–733, July 2013. doi: 10.1109/IMIS.2013.111.
July 2013. doi: 10.1109/ICMLC.2013.6890444.                          B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, and P.G.
Xing Liu and Jiqiang Liu. A two-layered permission-based             Bringas. On the automatic categorisation of android applica-
android malware detection scheme. In Mobile Cloud Com-               tions. In Consumer Communications and Networking Con-
puting, Services, and Engineering (MobileCloud), 2014 2nd            ference (CCNC), 2012 IEEE, pages 149–153, Jan 2012. doi:
IEEE International Conference on, pages 142–148, April               10.1109/CCNC.2012.6181075.
2014. doi: 10.1109/MobileCloud.2014.22.                              A. Shabtai. Malware detection on mobile devices. In
Yu Lu, Pan Zulie, Liu Jingju, and Shen Yi. Android mal-              Mobile Data Management (MDM), 2010 Eleventh Inter-
ware detection technology based on improved bayesian clas-           national Conference on, pages 289–290, May 2010. doi:
sification. In Instrumentation, Measurement, Computer,               10.1109/MDM.2010.28.
Communication and Control (IMCCC), 2013 Third Interna-               A. Shabtai, Y. Fledel, and Y. Elovici. Automated static
tional Conference on, pages 1338–1341, Sept 2013. doi:               code analysis for classifying android applications using ma-
10.1109/IMCCC.2013.297.                                              chine learning. In Computational Intelligence and Security
M.Z. Mas’ud, S. Sahib, M.F. Abdollah, S.R. Selamat, and              (CIS), 2010 International Conference on, pages 329–333,
R. Yusof. Analysis of features selection and machine                 Dec 2010. doi: 10.1109/CIS.2010.77.
learning classifier in android malware detection. In In-             L. Tenenboim-Chekina, O. Barad, A. Shabtai, D. Mimran,
formation Science and Applications (ICISA), 2014 Inter-              L. Rokach, B. Shapira, and Y. Elovici. Detecting applica-
national Conference on, pages 1–5, May 2014. doi:                    tion update attack on mobile devices through network fea-
10.1109/ICISA.2014.6847364.                                          tur. In Computer Communications Workshops (INFOCOM
A. Munoz, I. Martin, A. Guzman, and J.A. Hernandez. An-              WKSHPS), 2013 IEEE Conference on, pages 91–92, April
droid malware detection from google play meta-data: Selec-           2013. doi: 10.1109/INFCOMW.2013.6970755.




                                                                22
 Balaji Baskaran and Anca Ralescu                         MAICS 2016                                                 pp. 15–23


Te-En Wei, Ching-Hao Mao, A.B. Jeng, Hahn-Ming Lee,                    J. Xu, Y. Yu, Z. Chen, B. Cao, W. Dong, Y. Guo, and J. Cao.
Horng-Tzer Wang, and Dong-Jie Wu. Android malware de-                  Mobsafe: cloud computing based forensic analysis for mas-
tection via a latent network behavior analysis. In Trust, Secu-        sive mobile applications using data mining. Tsinghua Sci-
rity and Privacy in Computing and Communications (Trust-               ence and Technology, 18(4):418–427, August 2013. doi:
Com), 2012 IEEE 11th International Conference on, pages                10.1109/TST.2013.6574680.
1251–1258, June 2012. doi: 10.1109/TrustCom.2012.91.                   S.Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik. A new
Yu Wei, Hanlin Zhang, Linqiang Ge, and R. Hardy. On                    android malware detection approach using bayesian classi-
behavior-based detection of malware on android platform.               fication. In Advanced Information Networking and Applica-
In Global Communications Conference (GLOBECOM),                        tions (AINA), 2013 IEEE 27th International Conference on,
2013 IEEE, pages 814–819, Dec 2013. doi: 10.1109/GLO-                  pages 121–128, March 2013. doi: 10.1109/AINA.2013.88.
COM.2013.6831173.                                                      S.Y. Yerima, S. Sezer, and G. McWilliams. Analysis of
Westyarian, Y. Rosmansyah, and B. Dabarsyah. Malware                   bayesian classification-based approaches for android mal-
detection on android smartphones using api class and ma-               ware detection. Information Security, IET, 8(1):25–36, Jan
chine learning. In Electrical Engineering and Informatics              2014a. ISSN 1751-8709. doi: 10.1049/iet-ifs.2013.0095.
(ICEEI), 2015 International Conference on, pages 294–297,              S.Y. Yerima, S. Sezer, and I. Muttik. Android malware de-
Aug 2015. doi: 10.1109/ICEEI.2015.7352513.                             tection using parallel machine learning classifiers. In Next
Zhao Xiaoyan, Fang Juan, and Wang Xiujuan. Android                     Generation Mobile Apps, Services and Technologies (NG-
malware detection based on permissions. In Information                 MAST), 2014 Eighth International Conference on, pages
and Communications Technologies (ICT 2014), 2014 In-                   37–42, Sept 2014b. doi: 10.1109/NGMAST.2014.23.
ternational Conference on, pages 1–5, May 2014. doi:
10.1049/cp.2014.0605.




                                                                  23