Identification of Unusual Wallets on Ethereum Platform* Mikhail Petrov National Research University Higher School of Economics Myasnitskaya 20, Moscow, Russia mishanayrus@gmail.com Abstract. In this work we introduce a suspiciousness rating of Ethereum wallets. The rating is based on different characteristics of the wallets and the transactions they were involved in during a week. To achieve this goal atypical vertices of the transactions graph for the Ether cryptocurrency need to be discovered, so, first, we identify typical groups of nodes. Then the nodes which are far different from all these groups are considered to be suspicious. Keywords: Ethereum platform · cryptocurrency transactions · trans- actions graph. 1 Introduction Anomaly detection is an area that has been receiving much attention in recent years. It has a wide variety of applications, including fraud detection, network intrusion detection, medical diagnosis and other fields. Usually research in this area is using attribute-value data as the medium from which anomalies are to be extracted. Some works are focused on anomaly detection in graph-based data. In our paper we are going to combine these approaches for the goal of identification of suspicios wallets in a cryptocurrency community. 1 1.1 Cryptocurrencies and Ether The area of cryptocurrency is quite young. The first appearance of Ether, the transactions graph of which will be analyzed in this article, appeared in 2013. By now the analysis of the Ether community has progressed very little. One can find quite many articles on the analysis and prediction of cryptocurrencies rates, but so far few research has been focused on the analysis of exchange communities for the cryptocurrencies. It is known that the semantics of transactions of blockchain systems can be captured by a transaction graph, see e.g. [1]. Such a graph generally consists of * Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 M. Petrov the states and the transactions as transitions between the states, together with conditions for the consistency and validity of transactions. Detailed overview of cryptocurrencies could be found in [4]. 1.2 Overview of anomaly detection methods In [8] a comprehensive survey of recent anomaly detection systems and hybrid in- trusion detection systems is provided, and recent technological trends in anomaly detection are also discussed. In [2] existing techniques are grouped into different categories based on the underlying approach adopted by each technique. For each category key assumptions are identified, which are used by the techniques to differentiate between normal and anomalous behavior. In [7] two techniques for graph-based anomaly detection are introduced. The authors suggest a new method for calculating the regularity of a graph, with applications to anomaly detection. Experimental results are provided which use both real-world network intrusion data and artificially-created data. In [5] several information-theoretic measures, namely, entropy, conditional entropy, relative conditional entropy, in- formation gain, and information cost for anomaly detection are used. 1.3 Networks analysis As it was already mentioned, see [7], some anomaly detection method use graph- based analysis. Besides standard graph analysis library networkx, we are also using statistical properties of social networks, see [3, 6]. 1.4 Structure of the remaining text The rest of the paper is organized as follows. Section 2 describes the data we worked with. Section 3 describes the methods we used to analyze transaction data. Section 4 presents a description of how the points for the strangeness rating were calculated. Section 5 shows the top 5 vertices as a result of calculating the suspiciousness rating with the rating values. 2 The data analyzed In this paper data on all transactions for a week was downloaded using the open etherium API. Significant characteristics for the community analysis were cho- sen, and an ether exchange graph was constructed based on them. The vertices of this graph are the participants in the platform and also the edges are the transactions between these participants. Further in the received graph, the char- acteristics of the vertices were calculated, clusters were found, and associative rules based on them were constructed. As a result, taking into account the data on graph vertices, clustering, and associative rules, the suspiciousness rating of vertices was constructed. Identification of Unusual Wallets on Ethereum Platform 3 Since in this paper we were focused on identifying suspicious and untypical members in the ether exchange community, only those characteristics that were useful for analyzing and constructing the graph were extracted from the set of parameters. 2.1 Ethereum API Ethereum API provides information about each block, all transactions and every members of the network. JSON RPC API of the Ethereum platform currently supports four programming languages: C++, Go, Python, and Parity. Also there is an option to access the data via web interface at https://etherscan.io, which we used. For our study all transactions executed during a particular week have been downloaded. That was done using the timestamp field of the blocks and the method eth getBlockByNumber(). Totally information about 3,382,252 transactions were collected. 2.2 Format of Ethereum transaction data All transactions were selected from each block and then the following four pa- rameters were saved for each transaction: – address of the sender; – address of the receiver; – date and time of the transaction; – the amount of the internal currency (wei) that is transferred. These characteristics of transactions allow us to construct the graph of an ether exchange. The vertices in this graph are the addresses of wallets which are either senders or receivers of these transactions, the edges correspond to the transac- tions themselves. In this paper we analyze only total amount sent from node A to node B and ignore details about the number of transactions and distribution of the amount between them. In addition to the total volume transmitted between a pair of wallets, the frequency of interactions between participants (the number of transactions sent) could be considered too. 2.3 Transactions graph The resulting set of transaction on ether exchange was processed and presented as a graph. The first step to getting the graph was renaming vertices, since addresses do not make any sense. In order to save memory and more clarity, they were numbered with the help of natural numbers, and the dictionary with addresses mapping in the vertex numbers was saved in the file. Next, we summed the weights of the edges between identical pairs of sender and receiver, recorded the graph as an adjacency list with edge weights and 4 M. Petrov the total number of transactions between vertices. As a result, we obtained an undirected graph with positive and negative edge weights. The graph has 1,577,010 vertices and 4,963,980 edges. To analyze this graph we used networkx, a common library of the Python language. All further operations for calculating the various characteristics of the graph have been done with the help of this library. 3 Analysis Our anomaly detection analysis consists of the following four parts: 1. Graph connectivity analysis. During this step we put off wallets for which too few information is available, so no meaningful analysis could be conducted for them. 2. Analysis of the vertex characteristics. Here we compute degree, cen- trality and containing k-core for each vertex of the transaction graph. 3. Cluster analysis. The clustering is performed based on the vectors obtained in the previous section. 4. Association rules are used to find patterns that can be traced in this graph. After each part all the vertices are ranged according to their suspiciousness, and then these ratings are merged into one. 3.1 Connectivity analysis The next step was to analyze the connectivity components of the obtained graph. The graph has 35,628 connectivity components. These components can be di- vided into three groups: – The main component of a large community. In this component there are 1,474,024 vertices. Most likely, it is a graph of the interaction of people of some large service or exchange, but the exact meaning is not yet clear. This component has the biggest interest for analysis and will continue to be the first in priority. – The second type is a small groups of ten to a thousand people. Most likely, these groups are small companies in which payment occurs in the cryptocur- rency, or small communities that pay cryptocurrency for some services. This group also deserves attention for analysis, but to a lesser extent. – The third class includes groups of up to ten people. These are some local exchanges of money, betting between people. This group has no interest for analysis, since it does not contain any meaningful information. As was mentioned above, the first group has the greatest interest for ana- lyzing, therefore it will be analyzed in the future. The remaining connectivity components are stored in separate files and their analysis is possible in future work. Identification of Unusual Wallets on Ethereum Platform 5 3.2 Vertex Characteristics Since the main purpose of the work is to distinguish atypical and suspicious vertices, then we must determine the typical aspects of the obtained community. Taking into account that the received connectivity component can have multiple chains of elements with a degree equal to two, therefore the graph has many vertices that are just a link in the chain of transmission of the cryptocurrency, it was decided to allocate the k-cores of the graph of this component. A k-core of a graph G is a maximal subgraph of G in which all vertices have degree at least k. To do this, we run a search of the value of k-cores from 1 to 50 and used the method k core for each value. If the k-core was not found for the value of n, then it will not be found for n + 1, since any k+1-core is also a k-core. Therefore, if such a situation arises in searching the k-cores, then the search can be stopped. The algorithm stopped at k equal to 15, which means that the k-kernels from 1 to 14 were found. Since the sizes of k-cores beginning with k equal to 8 are already sufficiently small and the 3-core is quite large, 4, 5, 6, and 7 core were chosen as an optimal from the point of view of time for analyzing the graph. Based on the selected k-cores we can calculate new characteristics of the vertices. Three kinds of centrality were calculated for each vertex in the core: betweenness centrality, closeness centrality and degree centrality. Also we add vertex degree as a characteristic. On the basis of the obtained data, we can form a vector that will describe the vertices of the graph (participants of the cryptocurrency community), split each characteristic into groups of segments and create binary vectors for deriving associative rules. As a result, we obtained a 5-dimensional vector characterizing the vertices of the graph: 3 kinds of centralities; number of the k-core, which vertex belongs; degree of a vertex. Also, the dimensionality of these vectors could be increased by adding distance to the hubs for each of the characteristics, but this operation is planned for the next stages of the work. The resulting vectors can be clustered now. 3.3 Clustering Based on the vectors obtained in the previous section, clustering is performed using the standard k-means method. Since for this method it is necessary to initially know the number of clusters we must first determine this number. To determine the optimal number of clusters all numbers from 0 to 200 in steps of 10 were chosen, clustering was performed and the result were checked with Silhouette and the shoulder methods. 6 M. Petrov Table 1. The Silhouette Coefficients Number of clusters Coefficient value 10 20 30 40 0.6768368853142156 0.7095846965311733 0.8111177029982155 0.8702494303581093 50 60 70 80 0.9146516160235698 0.9293358278581296 0.9387072507702181 0.9478746102234425 90 100 110 120 0.9505421030096185 0.950768946865395 0.7874773068611389 0.7884553347797727 130 140 150 160 0.7698049042610146 0.7467290976039868 0.7992449839069017 0.7467532658132512 170 180 190 0.7044007926707637 0.6867901637311503 0.665831212711668 As you can see, this method shows that the most optimal number of clusters is n = 100, but not too far from 50 to 100. The result of the shoulder method is shown below: This chart shows that the optimal amount is n = 30. Since the optimal number of clusters depends very much on the type of data, both methods do not always correctly indicate the right result, so the combined solution of the two methods was chosen as the right answer. As a result, the optimal number of clusters was chosen to be 50 and clustering was performed for it using the k- Identification of Unusual Wallets on Ethereum Platform 7 means method. Distribution of the cluster size can be seen in the figure number 3: It can be seen from the distribution diagram that one strongly dominant in terms of the size of the elements cluster, two clusters slightly above the average level, about 15 clusters of medium size and the other clusters of a very small size were obtained. These results will be used in the next chapter to build associative rules. 3.4 Association Rules Now based on all of the characteristics of the community we can try to find patterns that can be traced in this graph. For example, it may happen that if the vertex belongs to the 4-core, then its degree must necessarily be higher than 7. To do this, it is necessary to generate binary vectors in which the element at the i-th position will report whether this element belongs to a certain group or not. Having formed such vectors, we can find patterns, but on first step we need to form the groups themselves. For this, it is necessary to analyze the distribution of the characteristics of the vectors. To achieve this, we should construction distribution histograms for each of the characteristics. The first and third graphs of distributions are strongly shifted to the left, and the thresholds 8 M. Petrov between the groups are not visible, so we need to look at the left parts closer: From the received observations it is possible to break each of the character- istics into ranges of values into which approximately equal number of elements will fall. These ranges will also form groups for finding associative rules. The result is the following groups: Table 2. Groups Degree centrality Betweenness centrality Closenness centrality 0 < x ≤ 0.005 0 < x ≤ 0.0005 0 < x ≤ 0.2 0.005 < x ≤ 0.015 0.0005 < x ≤ 0.001 0.2 < x ≤ 0.3 0.015 < x 0.001 < x 0.3 < x ≤ 0.4 0.4 < x Generation of binary vectors is now carried out. To search for associative rules we used the mlxtend library, which is the apriori method to search for frequent sets of attributes, and then search for associative rules. In our script the rules with a minimum support = 0.8 are taken so that the existing patterns are highly probable. As a result of the launch, two associative rules were found: Identification of Unusual Wallets on Ethereum Platform 9 Table 3. New rules table Number Antecedants Consequents Antecedent support Consequent support 0 1 deg. cen. group 1 bet. cen. group 0.997946 0.994034 1 1 bet. cen. group 1 deg. cen. group 0.994034 0.997946 Support Confidence Lift Leverage Conviction 0.993876 0.995922 1.001899 0.001884 1.463008 0.993876 0.999841 1.001899 0.001884 12.922448 To get more associative rules, we need to decrease the support threshold, but then there will be less probabilistic associative rules, which will contradict the goal of finding typical signs of this community. 4 Top strangeness rating Now, when there is a whole set of characteristics of the vertices (clustering results, clustered vectors, associative rules), we can start isolating their typical values and finding all the values that will go beyond this framework. For each overshoot of these boundaries points are entered, which will award the vertices with these atypical characteristics. For each ”suspicious” characterization points from 1 to 100 will be given depending on the criticality. At the end, the promised rating of the suspicious vertices of the community graph will be obtained. Let’s start with the simplest - associative rules. Here everything is simple, because if a vector does not obey this rule, then it is atypical for this community. Here, the scores in the scores will be determined by the value of the support of the associative rule multiplied by one hundred. The next will be the scoring of vertices whose vectors are too far from the center of their clusters. We can calculate the average distance to the center of each cluster and the variance. Accordingly, if the distance of the vector does not fall within the range of the mean ± variance, then this vector is atypical. Points will be calculated according to the following formula: |mean − distance| , dif f erencemax mean is the average distance to the center in the cluster distance - the point to the center of a particular windbreaker dif f erencemax is the maximum difference between the average and the distance from the vector to the center inside the cluster Next will be scoring the vertices, whose characteristics of the vectors also go beyond the limits of the mean ± variance. Here the same normalized formula will be used, as well as with distances. The last will be awarded for belonging to atypical small clusters. Since the sizes of some clusters reach even one, this clearly indicates the atypicality of 10 M. Petrov the vertices in such clusters. Here we can not use the metric, as in the previous two cases, because the dimensions are too scattered, the average is quite heavily shifted to the left and the variance is several times larger than the average, which moves the left border to the minus. With a normal distribution, the segment of the mean ± variance covers approximately 68 percent of the values and for a given number of clusters equal to 50 these 68 percent are 34. Therefore, it was decided to score points to the vertices contained in the 16 smallest clusters. Since the number is 16 the vertice that hit the latest one will get 6.25 points; which falls in the second from the end - 12.5 and so on up to 100. As a result of summation of the values on the vertices we get the rating of the suspicious vertices of the community graph for the exchange of ether. 5 Conclusion 5.1 Result As a result of this work the open API Ethereum platform was used to download data on all transactions for the week. A community graph on ether exchange was built for the week. Further, the analysis of the obtained graph, the separation of the characteristics of vertices, clustering of vertices according to the selected characteristics and the derivation of associative rules were carried out. The re- sult of the work is the algorithm for constructing the rating of the suspicious vertices of the community graph. Since the algorithm generates a rating based on deviations from the standard values inside the similar wallets groups, the algorithm finally reveals the most suspicious wallets that differ from the general background. These wallets most likely have large transaction volumes on the balances or passing through them. After summing up all the points by the nodes the rating of the suspicious vertices for the exchange of the Ether was obtained. Below are the top of 5 values of this rating: Table 4. Top 5 suspicious vertices Value Node number 499.66345040230755 311861 488.3862740302283 633953 481.4078191332377 16010 457.7035133901014 314045 437.19194102533953 1627 As you can see, there are vertices with a fairly large number of points at the top of the rating, given that the maximum is 700 points. This indicates that the same vertices received points of 4 or more rules. Hence, it is stands out of Identification of Unusual Wallets on Ethereum Platform 11 the general background immediately by several characteristics, which proves its strangeness. References 1. Christian Cachin, AD Caro, Pedro Moreno-Sanchez, Björn Tackmann, and Marko Vukolic. The transaction graph for modeling blockchain semantics. Technical report, Cryptology ePrint Archive, Report 2017/1070, 2017. 2. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009. 3. Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. 4. Jan Lansky. Possible state approaches to cryptocurrencies. Journal of Systems Integration, 9(1):19–31, 2018. 5. Wenke Lee and Dong Xiang. Information-theoretic measures for anomaly detection. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on, pages 130–143. IEEE, 2001. 6. Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Statistical properties of social networks. In Social network data analytics, pages 17–42. Springer, 2011. 7. Caleb C Noble and Diane J Cook. Graph-based anomaly detection. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2003. 8. Animesh Patcha and Jung-Min Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer networks, 51(12):3448– 3470, 2007.