1. Introduction

Fraudulent Behaviour Identification in Ethereum Blockchain

Karolis Lašas

Gabriel ė Kasputyt ė

Ru¯ ta Užupyt ė

Tomas Krilavičius

0 0 Baltic Institute of Advanced Technology, Department of Applied Informatics , Vytautas Magnus, Kaunas , Lithuania 1 Baltic Institute of Advanced Technology, Department of Mathematics and Statistics , Vytautas Magnus, Kaunas , Lithuania

78 85

The phenomenon of cryptocurrencies continues to draw a lot of attention from investors, innovators and the general public. There are over 1300 diferent cryptocurrencies, including Bitcoin, Ethereum and Litecoin. While the scope of blockchain technology and cryptocurrencies continues to increase, identification of unethical and fraudulent behaviour still remains an open issue. The absence of regulation of the cryptocurrencies ecosystem and the lack of transparency of the transactions may lead to an increased number of fraudulent cases. In this research, we have analyzed the possibility to identify fraudulent behaviour using diferent classification techniques. Based on Etherium transactional data, we constructed a transaction network which was analyzed using a graph traversal algorithm. Data clustering was performed using three machine learning algorithms: k-means clustering, Support Vector Machine and random forest classifier. The performance of the classifiers was evaluated using a few accuracy metrics that can be calculated from confusion matrix. Research results revealed that the best performance was achieved using a random forest classification model

eol>Cryptocurrency Ethereum Blockchain Fraudulent Activity K-Means Clustering Support Vector Machine Random Forest Classifier

1. Introduction

tocurrency by market capitalization, is the top choice for fraudulent activity. The aim of this paper is to anaCryptocurrencies are a viable alternative to traditional lyze the possibility to use machine learning techniques mediums of exchange for purchasing goods or services. to identify wallets engaging in fraudulent activities in The main idea behind such type of currency is that the Ethereum blockchain. exchange between two parties can occur without the The rest of the paper is organized as follows. Related involvement of a central authority. It is the network it- work in this area is presented in section 2. Section 3 inself that manages and confirms each transaction. The troduces the dataset used in the current study and the overall history of transactions is controlled using the performed preprocessing steps. Section 4 presents the blockchain technology, which can be described as a selected clustering techniques. Section 4.4 describes growing list of records, that are linked together using accuracy metrics that was used to evaluate compucryptography. Each block contains a cryptographic tational results. Experimental results are provided in hash of the previous block, a timestamp and transac- section 5. Finally, concluding remarks and future plans tion data. Even though blockchain technology records are discussed in Section 6. information about each transaction, it also assures person anonymity, as long as there is no link between the wallet and its owner identity. Due to this reason, 2. Related Work cryptocurrencies are more frequently used for fraudulent activities[ 1 ]. As collected by blockchain foren- Fraudulent activity identification in cryptocurrency is sics company CipherTrace [ 2 ], the increasing amount discussed in [ 4 ]. The article aims to develop a Superof scams led to 4.5 billion dollars in losses in 2019. vised Machine Learning based novel approach to de– According to the blockchain monitoring company [ 3 ] anonymize the Bitcoin ecosystem and identify crimEthereum blockchain, which is the second largest cryp- inal activities in Bitcoin blockchain. The substantial number of Bitcoin addresses were already identified, IVUS 2020: Information Society and University Studies, 23 April 2020, clustered and categorized by the data provider. HowKTU Santaka Valley, Kaunas, Lithuania ever, main part of clusters were uncategorized. In over" karolis.lasas@bpti.lt (K. Lašas); gabriele.kasputyte@bpti.lt (G. all, the dataset contains around 395 million transacKasputytė); ruta.uzupyte@bpti.lt (R. Užupytė); tions related to 957 unique clusters. tomas.krilavicius@bpti.lt (T. Krilavičius) The 957 observations which were labeled by the data CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g C©Co2Em02Um0oRCnospWLyircieognhrstekfAosrthttrhiobisupptiaoPpnerr4o.b0ycIniettseeardnuatihtnioorgnsa.slU((sCeCCpEBerYUm4iR.t0te)-.dWunSde.roCrrgea)tive dparotavsiedterinwcleurdeesucsaetdegfoorrietsracionminmgoannldy atesssotcsieattse.d wThitihs illegal activities, including darknet market, mixing, ran- site – asking for private keys or fake crowdsale site. somware, scam, stolen bitcoins, and gambling from 125 addresses were identified as malicious and later the perspective of certain jurisdictions. The research were split into 75 for training and 50 for testing as method consisted of three iterations using three sepa- ground truth. After taking the previously mentioned rate datasets: the initial dataset, the dataset with over- assumption 3830 addresses were marked as malicious. sampled minority classes, and the final, where all classes The best results was achieved using second evaluation were over-sampled to achieve the same number of the model were SVM, Decision Tree classifier and Random most populated class observations. Forest classifier produced the result with the same ac

Upon comparing the results of the three iterations curacy of 99.66%. Moreover, 5-fold cross-validation the over-sampled datasets of the models were discarded. was used to prevent the models from over-fitting. Moreover, the performance across seven algorithms: A comprehensive identification model for detection Decision Trees, Bagging, Random Forests, Extra Trees, of phishing scams in Ethereum is discussed in [ 7 ]. In AdaBoost, Gradient Boosting and k-Nearest Neighbors, this work, a large-scale Ethereum transaction network was compared and the best four: Gradient Boosting, was built. Additionally, a novel network-embedding Random Forests, Extra Trees and Bagging Classifier, algorithm called trans2vec with biases of transaction were chosen. Finally, Gradient Boosting was selected amount and timestamp was designed to extract feaas the most accurate algorithm with an average cross– tures from the Ethereum transaction network. Morevalidation accuracy of 80.83%. Anomalies detection in over, on account of data imbalance and network hetBitcoin network was analised in [ 5 ], where three un- erogeneity, the one–class SVM was adopted to classify supervised learning methods: k-means clustering, Ma- the phishing and non–phishing addresses. Finally, the halanobis distance method and Unsupervised Support article concluded that after applying real information Vector Machines, were applied. of Ethereum transactions, the results showed that pro

In this research Bitcoin transaction network were posed detection framework is efective and trans2vec is transformed into two graphs: with nodes as users and more superior than baseline methods in terms of feawith nodes as transactions. The dataset consists of ture extraction. more than 6 million unlabeled users with more than To sum up, some of these articles claim to have a 37 million transactions and 30 revealed thieves in Bit- high accuracy of fraudulent behavior identification recoin network. However, due to the long run-time, the sults, while there are few low accuracy results in other dataset were limited to 100,000. Both Unsupervised articles. One of the article has detected that a new alSVM method and Mahalanobis distance based method gorithm gives better results than the basic methods. suggested similar suspicious users. In this case two The diferent types of data, its size and information cases of theft and one case of loss out of the 30 known have caused the diferences between the results, while cases were detected. applying the same models. In order to analyze the ac

The use of machine learning techniques for the iden- curacy while using our own data, we decided to use tification of abnormal activities in the Ethereum net- 3 very popular and the most common methods: Kwork is discussed in [ 6 ]. In this case, decision tree clas- Means clustering, Support Vector Machine and Ransifier, k-nearest neighbors, Random forest, Support- dom Forest classifier. vector Machine (SVM), Multi-layer perceptron (MLP) and Naive Bayes algorithms were compared. Using dataset consisting of 169,192,702 Ethereum transactions 3. Data preprocessing and two evaluation models were analysed: features‘ extraction 1. testing on 50 originally marked malicious ad

dresses; 3.1. Initial data 2. testing on randomly 50 malicious addresses out A data set consists of two collections of Ethereum transof possible 3830, under the assumption that the actions. The first collection is composed of about 420 addresses are marked as malicious, if they have fraudulent wallets identified from etherscamdb.info dataan outgoing transaction with the malicious mark- base. A detailed information about their transactions ed addresses was gathered from etherscan.io. The second data colMalicious addresses are considered to be the ones which lection represents non-fraudulent activities and conperform unauthorized or illegal actions, such as: issues sists of 53 wallets and their transactional information fake tokens, fake admin in ICOs (Initial coin Ofering), gathered from etherscan.io database. Each data set inscambot phishers, slackbot, fake etherscan site, fake cludes: • transaction hash code • sender’s address • receiver’s address • transaction value • time at which transaction was made • Ethereum block number.

Transactional data was transformed into a graph, where a manual selection of clusters. Algorithm‘s inability where x is a data point and

is a -th cluster‘s centroid. Each centroid is calculated by averaging given input vectors: =

1 | | ∑ .

∈ The objective of a k-means algorithm is to minimize total intra-cluster variance.

Among the many disadvantages of the k-means clustering algorithm, such as vulnerability to outliers or inability to cluster heavily overlapping data, there is to automatically select an optimal number of clusters in some cases makes it the unreliable solution to data partitioning as defining a number of clusters for unlabeled data leaves the user with uncertainty especially when working with large amounts of data. However, there is no need for guessing the number of clusters as there are a few methods that search for an optimal number of clusters. One of them is the elbow method.

It is one of the oldest methods for defining an optimal number of clusters and works by calculating the sum closest centroid [ 8 ]: 3.2. Features extraction the nodes represent wallets and edges indicate money transfers. Using a graph traversal algorithm, we identify parameters representing each wallets behaviour: • total value in ETH sent by a wallet; • total received value in ETH by a wallet; • a number of transactions sent by a wallet; • a number of transactions received by a wallet over a time period; sending wallet; ing wallet; • average time between transactions to a receiv• standard deviation of time between transactions

performed by a sending wallet; • standard deviation of transaction time in seconds to receiving wallet - standard deviation of time between transactions to a receiving wallet; • average value in ETH sent by a wallet; • average value in ETH received by a wallet.

Methodology 4.1. K-Means Clustering

The first method that we considered was the k-means techniques. Also k-means clustering may help to determine underlying patterns of fraudulent and nonfraudulent behaviour by grouping similar wallets’ activities. K-means clustering algorithm works by allocating data points from given input vectors to a predeifned number of clusters using similarity criteria, usually Euclidean distance: || − || , 2 • average time between transactions performed by of squared distances between every data point and its siderably small comparing to other similar clustering clustering algorithm as its computational times are con- perplane in a -dimensional space (where is a number of factors used as input for the model) and sep ∑ =1 ∈ ∑ || − || .

2 The optimal number of clusters can be identified by visible "elbow" on the curve (see fig. 1). The last number before curve flattens is an optimal count of clusters. The main drawback of this method occurs when there is no visible "elbow" on the curve or more than one "elbow" is visible.

4.2. Support Vector Machine

In order to find an optimal boundary between wallets with fraudulent and non-fraudulent behaviour, Support Vector Machine (SVM) is used. It ofers high accuracy and requires less computational power than other machine learning algorithms. SVM aims to find a hyarates given data points into new classes. SVM can be used both for regression and classification problems [ 9, 10 ]. Consider data set consisting of m pairs of records ( 1, 1), ( 2, 2), … , ( , ) as a training set, where ∈ R and

∈ {−1, 1} these pairs, we define a hyperplane that will separate [ 11 ]. In order to classify them: { ∶ ( ) = + 0 = 0}, where is a unit vector (|| || = 1 plane ( ), a rule for data classification can be written ). Using defined hyper ( ) =

[ + 0].

For a nonlinear SVM classification, kernel method is being used. Kernel method generates algorithms that space. Popular kernel functions used in this method are: • Polynomial: ( , ) = ( ⋅ + 1) , where is a degree of polynomial; • Gaussian radial basis function (RBF): ( , ) = exp{− || − || } 2 where > 0

; • Sigmoid: (, ) = ℎ ( + )

4.3. Random Forest Classifier

Random Forest is a supervised machine learning algorithm that can be used to solve classification or regression problems and is more flexible with input data than SVM, especially working with large amounts of data.

It is a decision tree–based algorithm that randomly selects various data samples and by calculating predictions for every tree makes decisions from which it partitions input data into new subsets. It uses averaging maps given input data into a high-dimensional feature to improve the classification accuracy and controls the model to avoid over–fitting. For a -dimensional input = ( 1, 2, … , ) the goal of a random forest

( ) for predicting a response variable Y. The predictive function minimizes the expected value of the loss by using a loss function ( , ( )) that usually is zero-one loss [ 12 ]: ( , ( )) = { 0, if 1,

= ( ) otherwise

4.4. Accuracy evaluation

To estimate the accuracy of the proposed models, we use a few commonly used metrics [ 13, 14, 15, 16 ] that can be calculated from confusion matrix also known as contingency table (see table 1) : • True Positive Rate:

TPR also is known as sensitivity or recall, shows the amount of successfully predicted class‘ values compared to all class‘ values in a data set. • True Negative Rate (Selectivity): = =

Kernel F1-measure

Polynomial 92

Sigmoid 89

GRB 93

Linear 89

5. Results 5.1. K-Means Clustering

In this case, we decided to cluster the data into two groups referring to fraudulent and non-fraudulent wallets. We also performed an Elbow method to identify the optimal number of clusters (fig. 1), which conifrmed that two clusters are an optimal choice. Using the actual data labels, we evaluated the accuracy of the k-means algorithm. Results revealed that overall clustering accuracy reaches 87% (see table 2). However, while fraudulent wallets were clustered with 93% accuracy, all non-fraudulent wallets were labeled as frauds (table 2). A more detailed study of clustering results was carried out using graphical analysis. For example, figure 2 represents the relationship between the average value in ETH sent by a wallet and the average time between outgoing transactions. Diferent colours TNR also known as selectivity, is the amount of Table 3 successfully predicted values for another class. Accuracy for diferent types of kernel • Precision (Positive Predicted Value): 1-measure is a harmonic mean of recall and precision [ 17 ] and refers to classification accuracy.

Here is true possitive (successfully predicted first class‘ values), is true negative (successfuly predicted second class‘ values), is false positive (faulty predicted second class‘ values also refered as type I error) and is false negative (faulty predicted first class‘ values also refered as type II error). represent separate clusters. By comparing clustering results with the labelled dataset (fig. 3), we can see that the algorithm identifies the most extreme cases (cases with the largest values). However, the model is unable to separate the rest of the data. Based on these results, we can conclude that k-means clustering provides unreliable results.

5.2. Support Vector Classifier

In order to achieve the best classification result, we have performed experiments using four support vector machine classification models: • linear SVM; • SVM with polynomial kernel; • SVM with sigmoid kernel; • the average value in ETH sent by a wallet; • average time between outgoing transactions; • standard deviation of time between outgoing transactions; • frequency of outgoing transactions.

After defining the list of parameters that have the highest influence on classification results, random forest classification algorithm was performed. To evaluate model‘s accuracy we used accuracy metrics discussed in subsection 4.4. RFS model reaches 95% accuracy (see table 5). This method predicts fraudulent wallets with 97% accuracy and non-fraudulent wallets with 67%.

6. Conclusions

• SVM with Gaussian Radial Basis (GRB) kernel. In this research, we investigated three machine learning techniques to identify fraudulent behaviour in the

Labeled data set was split into training (80 percent Ethereum blockchain data set. First of all, we sugof data) and testing (20 percent of data) sets. The high- gested the data preprocessing framework for the exest accuracy (93%) was achieved by using nonlinear traction of individual behaviour patterns from a transSVM model with Gaussian Radial Basis (GRB) kernel actional dataset. Based on these patterns, the proposed (table 4). However, although using nonlinear SVM with models were trained and compared according to seGRB kernel 96% of fraudulent wallets were classified lected accuracy measures. Experimental results revealed correctly, 54% of non-fraudulent wallets were classi- that the random forest classification method is the most ifed as frauds. suitable for the identification of fraudulent behaviour. Furthermore, the model suggests that the most impor5.3. Random Forest Classifier tant factors for fraudulent behaviour identification are total value in ETH sent by a wallet, the average value After performing classification with RFC with 90 trees, in ETH sent by a wallet, the average time between outwe extracted feature importances for model fine tun- going transactions, the standard deviation of time being (fig. 4). Parameters with importance level higher tween outgoing transactions and the frequency of outthan 0.1 were selected as the most important: going transactions.

• total sent value in ETH; In the future, we are planning to improve the proposed model‘s reliability by increasing the number of both fraudulent and non-fraudulent wallets. Moreover, 7. Acknowledgments we are planning to analyse the possibility to use XGBoost method, as it was suggested to use for identiifcation of abnormal activity in blockchain data [ 18 ].

Furthermore, we are planning to perform a statistical significance test in order to find out whether diferences between results are statistically significant.

We thank Tadas Tamošiu¯nas, Pavel Sokolov and UAB

Kevin EU 1 for cooperation and useful insights.

[1] Baum , S. C. , Cryptocurrency fraud: A look into the frontier of fraud , 2018 .

[2] Ciphertrace , 2020 . URL: https://ciphertrace.com.

[3] Chainalysis , 2020 . URL: https://www.chainalysis. com/.

[4]

H. H.

Sun Yin ,

Langenheldt ,

Harlev ,

R. R.

Mukkamala ,

Vatrapu , Regulating cryptocurrencies: a supervised machine learning approach to de-anonymizing the bitcoin blockchain , Journal of Management Information Systems 36 ( 2019 ) 37 - 73 .

[5]

Pham ,

Lee , Anomaly detection in bitcoin network using unsupervised learning methods , arXiv preprint arXiv:1611.03941 ( 2016 ).

[6]

Sing , Anomaly Detection in the Etherum Network , Ph.D. thesis , Indian Institute of Technology Kanfur, 2019 .

[7]

Wu ,

Yuan ,

Lin ,

You ,

Chen ,

Zheng , Who are the phishers? phishing scam detection on ethereum via network embedding , arXiv preprint arXiv: 1911 . 09259 ( 2019 ).

[8]

T. M.

Kodinariya ,

P. R.

Makwana , Review on determining number of cluster in k-means clustering , International Journal 1 ( 2013 ) 90 - 95 .

[9]

Awad ,

Khanna , Support vector machines for classification , in: Eficient Learning Machines , Springer, 2015 , pp. 39 - 66 .

[10]

Beritelli , G. Capizzi,

G. Lo

Sciuto ,

Napoli ,

Scaglione , Rainfall estimation based on the intensity of the received signal in a lte/4g mobile terminal by using a probabilistic neural network , IEEE Access 6 ( 2018 ) 30865 - 30873 . doi: 10 . 1109/ACCESS. 2018 . 2839699 .

[11]

Hastie ,

Tibshirani , J. Friedman, The elements of statistical learning: data mining, inference, and prediction , Springer Science & Business Media , 2009 .

[12]

Cutler ,

D. R.

Cutler ,

J. R.

Stevens , Random forests , in: Ensemble machine learning , Springer, 2012 , pp. 157 - 175 .

[13]

Fawcett , An introduction to roc analysis , Pattern recognition letters 27 ( 2006 ) 861 - 874 .

[14] D. M. Powers , Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation ( 2011 ).

[15]

Capizzi ,

G. Lo

Sciuto ,

Napoli ,

Polap ,

Woźniak , Small lung nodules detection based on fuzzy-logic and probabilistic neural network with bio-inspired reinforcement learning , IEEE Transactions on Fuzzy Systems 6 ( 2020 ).

[16]

Beritelli , G. Capizzi,

G. Lo

Sciuto ,

Napoli ,

Woźniak , A novel training method to preserve generalization of rbpnn classifiers applied to ecg signals diagnosis , Neural Networks 108 ( 2018 ) 331 - 338 .

[17]

Z. C.

Lipton ,

Elkan ,

Naryanaswamy , Optimal thresholding of classifiers to maximize f1 measure , in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2014 , pp. 225 - 239 .

[18]

Ostapowicz ,

Żbikowski , Detecting fraudulent accounts on blockchain: A supervised approach , in: International Conference on Web Information Systems Engineering , Springer, 2019 , pp. 18 - 31 .