-

Exploiting Smart Contract Bytecode for Classi cation on Ethereum

Schloss Birlinghoven

Sankt Augustin

Germany

firstname.lastnameg@fit.fraunhofer.de

0 RWTH Aachen University , Ahornstra e 55, 52074 Aachen , Germany

11 22

Due to the increase in smart contracts in Ethereum, a need for proper classi cation has emerged. Although the smart contracts are accessible due to the open nature of the Blockchain, readability is still an issue with respect to the smart contract bytecode. We propose an automated approach for classifying smart contracts that utilize popular text classi cation methods on the opcode translation of the smart contract bytecode in order to overcome this limitation. Our experiments indicate that the decision-tree-based techniques like Random Forest and Xgboost outmatch the traditional classi cation tools like Nave Bayes, Logistic Regression, and SVM once the opcode input is presented as n-gram tfidf vectors.

Blockchain Text Classi cation Smart Contract

Being one of the most famous platforms since 2015, Ethereum has enabled many decentralized applications built on top of smart contracts and has been home to roughly more than 20 million smart contracts and hundreds of millions of transactions between the members[ 1 ]. This large amount of data has made it hard to get an overview of this large amount of smart contracts and their services. However, to be able to identify similar contracts, with other words to be able to classify them properly, can ease this problem. Proper classi cation of smart contracts might also enhance the comprehension of the smart contract application trends by nding examples for di erent categories and therefore contribute to the enhancement of the smart contract programming for which e ciency is essentially needed, and work as a cross-validation tool by providing an overall sense if the smart contract that a user interacts with actually accomplishes what it ensures.

Classi cation of smart contracts is an interesting challange. The only accessible data with respect to smart contracts is the compiled contract bytecode since the smart contracts stay anonymous if the source code is not made public by the developers[ 18 ]. The number of open-source smart contracts remains considerably low compared to the ones without published source code and using the source code for classi cation gives only limited results for the entire network. Therefore, we suggest the bytecode of the smart contracts to be exploited for the classi cation instead.

With this work, we examine the probable use of text classi cation methods on the opcode translation of smart contracts to determine if the analogy between the instruction sequences in a smart contract opcode and word sequences in texts can help to represent the bytecode in machine learning applications. In that regard, we explain the pipeline of our approach and the results obtained by the application of this pipeline for some prede ned classes. 2 2.1

Related Work Smart Contract Classi cation

Automated classi cation means have only been sparsely used for smart contract analysis and comparison. Bartoletti and Pompianu[ 3 ] conducted one of the rst studies in that regard by analyzing the published source codes of the 811 smart contracts from Ethereum and 23 from the Bitcoin network and creating ve categories that represent their data as Financial, Notary, Game, Wallet, and Library. Closely related, Klein[ 14 ], and Wohrer and Zdun[ 24 ] provide two studies on smart contract design patterns based on detailed research on literary work. While Klein identi es ten support and six application patterns, Wohrer and Zdun suggest 18 patterns that are classi ed into ve categories as Action and Control, Authorization, Lifecycle, Maintenance, and Security.

He et al.[ 11 ] present an analysis of the Ethereum network with a motivation to take a deeper look at the code reuse in smart contracts. They apply a customized fuzzy hashing logic to create smart contract ngerprints based on opcode and utilize the edit distance between ngerprints to calculate a similarity score that can be used for smart contract clustering. This process leads to four distinct clusters as ERC-20 Clusters, Game Contracts, ICO and AirDrop Contracts, and Other Contracts.

Tian et al.[ 23 ] o er a mixed approach based on Bi-LSTM model and Gaussian LDA to classify smart contracts accurately. They make use of the information extracted from smart contract source code like code comments in addition to the source code itself, the application labels fetched from www.stateofthedapps.com, and the account information. The proposed model is found to outperform alternative baseline models for the classi cation of smart contracts into one of the prede ned categories like Entertainment, Tools, Management, Finance, Lottery, Internet of Things, and Others.

There have also been di erent inspirations that lead to other classi cation approaches. Security vulnerabilities of smart contracts have motivated Tann et al.[ 22 ] into classifying contracts according to having a code vulnerability. They translate the opcode of the smart contracts into one-hot vector sequences which are then used in an LSTM model with some case-dependent modi cations. This approach has led to a tool with high detection accuracy that functions on the smart contract bytecode and identi es the security vulnerabilities of a smart contract. Another study led by Chen et al.[ 6 ] presented a comparison of di erent approaches in addition to di erent feature extraction techniques for detecting Ponzi schemes by using di erent characteristics of smart contracts coupled with the tf-idf measures of the smart contract opcode. They conclude that the Random Forest algorithm trained to solve a binary classi cation problem is found optimal for the detection of Ponzi schemes. 2.2

Text Analytics

One major application of machine learning has been text classi cation. The classi cation problem itself is de ned by Aggarwal and Zhai[ 2 ] as the utilization of a set of training entries D = fX1; :::; XN g with a label of a class picked from a set of k distinct entries with indexes f1; :::; kg to build a classi cation model that maps the descriptive features of each entry to a class label. In order to adapt this de nition into texts, texts and documents need to be transformed into a structured feature space to be able to pro t from the mathematical components of a classi er model since texts and documents are originally unstructured datasets[ 15 ]. Many popular approaches like the bag-of-words model[ 13 ], TF-IDF[ 19 ] measure, and word embeddings like Word2Vec[ 17 ] have been suggested for this purpose.

Deep learning and machine learning techniques have been the primary means for traditional text classi cation tasks[ 23 ]. Logistic regression[ 12, 16 ] and the Nave Bayes classi er[ 20 ], some of the earliest and straightforward classi cation tools, have been made use of to retrieve information. Some of the other methods that were investigated and utilized for texts include non-parametric algorithms like k-nearest neighbor[ 26 ], decision-tree-based classi ers such as Random Forest[ 4, 25 ] and Xgboost[ 5 ], and Support Vector Machine (SVM)[ 8 ][ 15 ]. Furthermore, the use of the neural network models like LSTM for modeling the complex relations with the data has also been motivated with the recent enhancements on deep learning techniques[ 23 ].

While some of the studies mentioned target classifying di erent smart contract applications, they do not make sense of the context through the smart contract bytecode. Furthermore, the studies that utilize the bytecode are built for a narrow use case and do not address di erent usage of smart contracts. In our study,we carry the use of bytecode to a broader perspective with the help of text analytics tools. 3

Methodology

The categories to be predicted in our study consist of the application patterns suggested by Klein[ 14 ] as voting, auction, entity management, renting, trading, and track &trace because of the extensive research comprehensive research conducted using the literature in addition to the descriptive and well-structured de nitions provided for the patterns. Since a smart contract can belong to multiple categories, we decide to apply binary relevance to our problem that enables using separate data representations and algorithms for di erent categories and allows for the extension of the number of categories in the future work. Finally, our experiments are limited to ve text classi cation algorithms, three of which appear traditional and easy to understand methods namely Nave Bayes, Logistic Regression, and Support Vector Machine(SVM) and the remaining are decision-tree-based algorithms like Random Forest and Xgboost due to the recent popularity they have. Because of the small number of examples in the dataset, deep learning algorithms (e.g., LSTM) are excluded from the study. 3.1

Data Collection & Preprocessing

The public crypto ethereum dataset in Google BigQuery3, which gets updated daily through a fully synced node, is utilized to collect bytecode of all the smart contracts residing in Ethereum and the corresponding addresses. Out of 16 426 716 smart contracts deployed to Ethereum by the data collection day, 6th January 2020, only 283 898 of them have been found to have a unique bytecode. While 84 909 contracts are found to have a published source code after querying etherscan.io, a famous block explorer, the source code of 198 989 contracts remained unknown. Finally, the opcode representation of each smart contract bytecode is obtained using an evm bytecode disassembler based on Python named evmdasm4 and the function names are extracted from the source code to ease the dataset creation process. 3.2

Dataset Creation

The class-speci c datasets which are later used for training and testing are created using the smart contracts with the available source code. Because of the timing constraints of the study and the high number of smart contracts for labeling, a ltering process based on domain keywords is applied to the extracted function names to get potential positive and negative examples for a class dataset. Due to their achievable size, we look at the source code of the potential positives and label them manually according to the class descriptions. However, the high number of potential negatives hinder us from getting a manual investigation over the source code for labeling. Thus, the lter keywords for potential negatives are extended via recursive analysis of the randomly sampled function names and the ltering is kept as strict as possible to refrain from any mislabeled examples in the dataset.

To illustrate the ltering keyword choices better, the Voting class can be examined. The contracts that include directly use case related words vote or 3 https://cloud.google.com/bigquery 4 https://github.com/ethereum/evmdasm ballot keywords in their function names are picked for potential positives. After randomly sampling function names, they are extended to include the smart contracts that have both or support and against at the same time. For the potential negatives, rst a set of negative ltering keywords are created with random function name sampling to nd voting use case related words and the contracts that do not have any of the keywords of vote, ballot, voting, proposal, support, and against are chosen as potential negatives.

In the end, the positive and negative examples are merged to create a dataset that represents the class well enough. A summary of this process can be seen in Fig.1. Since there is a di erent dataset obtained for each class, it is possible to nd examples that are positive for one class but negative for another or even positive for multiple classes, e.g. some gaming contracts which apply both entity management and auctioning. 3.3

Model Selection

Initially, the class dataset is split into a larger training&validation set that is used for model training and a smaller test set to determine the generalization ability of the nal model. The opcodes of the smart contracts inside the training&validation set are transformed into a feature space using three di erent vector representations including a count vector (e.g.[ 9 ]) which consists of the instruction frequencies inside an opcode, a tf-idf vector which indicates the tf-idf weights of instructions inside an opcode, and an n-gram tf-idf vector (e.g. [ 27 ]) that displays the tf-idf (term frequency-inverse document frequency) weights of n successive instructions inside an opcode. The default value for the n value in initial experiments are set as (9-10) as we nd it convenient through our preliminary experiments. Afterward, experiments on the training&validation set are conducted to obtain an average score using the k-fold cross-validation technique about one of the ve chosen algorithms (Nave Bayes, Logistic Regression, SVM, Random Forest, and Xgboost) coupled with one of these vector representations. Although there are several evaluation metrics that suit highly imbalanced datasets, the comparison and nal evaluation metric is chosen to be the Matthews Correlation Coe cient (MCC) as it is not independent from the true negatives[ 7 ] and in our case identifying both cases correctly is important. The de nition of MCC can be seen in Equation 1.

MCC =

T P

T N

F P

F N p(T P + F P ) (T P + F N ) (T N + F P ) (T N + F N ) (1) where TP, FP, TN, and FN indicate the number of true positive, false positive, true negative, and false negative predictions on the test set, respectively. It takes a value between 1 which indicates a complete disagreement between the real values and predictions and 1 which corresponds to a perfect classi er. A random classi er is expected to take a value around 0.

The highest scoring algorithm and data representation are then selected for additional improvements. A summary of the model selection phase can be seen in Fig. 2. 3.4

Model Improvement

Two potential spots for enhancement have been identi ed with regard to the selected algorithm and data representation. The rst one is applicable if n-gram tf-idf vector is found to be the highest-scoring data representation by trying out di erent n-gram ranges. This is realized by following the same methodology as in the model selection step by testing out di erent n-gram ranges depending on how much it changes the average score obtained via the k-fold cross-validation technique to increase or decrease the range. As higher n values cause drastically increased computation power, the range is kept as small as possible if the increase in the scores is not meaningful. A summary of this step can be seen in Fig. 3.

The second enhancement addresses the imbalance problem. It tests di erent resampling ratios for positive and negative examples within the training set and compares the results obtained on the validation set once the training&validation set is split into separate sets as a training set and validation set. For resampling, either the positive examples are increased by adding copies of randomly chosen positive examples, or the negative examples are decreased by the removal of some random ones to make the ratio of positive and negative examples better [ 10 ]. However, a drastic change is avoided to refrain from over tting or loss of information. At the end of this process, if the test scores obtained via applying the model on the test set is compatible with the best validation scores, the nal model is applied to all the smart contracts to get the nal predictions for belonging to the trained class. A summary of this step can be seen in Fig. 4. The dataset creation step creates class datasets with di erent distributions which are depicted in Table 1. The Track&Trace class is excluded from the study and left for future work as there were not any representative ltering domain words found. For the remaining classes, a high imbalance in favor of negative examples can be observed.

The three highest scoring algorithms and data representations are depicted in Table 2. While for some classes like Voting and Renting, the chosen method and data representation outperforms other options by a large margin, the score di erence between the top models is not so large for the other classes. On the other hand, the success of decision-tree-based algorithms compared to other tested algorithms becomes apparent. Furthermore, n-gram tf-idf vector is found to be the most successful data structure to represent the smart contract opcode.

Table 3 shows the algorithm, data representation, and corresponding average or nal MCC score by the application of the model after each step of the learning process for each class. Model Improvement I and Model Improvement II correspond to the testing of di erent n-gram ranges and resampling ratios, respectively. If there is no improvement observed for a step, no MCC score is given. However, it can already be observed that the e ect of the model improvement phase is comparably small when the overall scores are considered. Finally, since there were no prior studies based on the same dataset, we compare our results to the results obtained by the application of a random predictor that randomly assigns 0 or 1 for belonging to the class. The MCC score comparisons of the test results of the nal models and random model results show that the classi ers trained on the opcode representations of the smart contract bytecode outperform the random models for each class which is promising for further applications on the smart contract bytecode. Even though almost all the classi ers seem to do a decent job, further investigations are needed for the underperformance of the Renting classi er even though the fairly lower amount of positive examples in the dataset might be a factor. Further details about the methodology and results can be found in [ 21 ].

Threats to Validity

There have been two important issues found that might pose a threat to the validity of the test outcomes. While creating the class datasets, the ltering is kept very strict to remove the possible examples with false labels. For example, any smart contract with a buy function is discarded from the Trading class. This might cause the class dataset to be unrepresentative for all smart contracts with various application results. Hence, the results obtained on the test set, which is also derived from the class dataset, that shows how well the model generalizes might be inadequate. Another possibly problematic issue can be related to the disassembler used to get the opcode representations of the smart contracts. Some of the commands in smart contract bytecode were unknown and were discarded during the bytecode to opcode translation whose e ects are not explored in our study.

Conclusion & Future Work

So far, the classi cation of smart contracts with respect to application patterns has been based on code analysis. We employed text analytics to analyze smart contracts on the bytecode level. This analysis has allowed us to derive di erent smart contract use cases. This approach is easy to implement due to its few processing stages and applicable to a larger amount of data since it is not sourcecode dependent once it is trained. If there is enough data available for training, the nal test results show a good rate of classi cation and makes it promising to treat the smart contract opcode like a text for future applications.

For future work, we plan to extend our approach with a better labeling process of the smart contracts to be able to tackle the representativeness problem of the class datasets. Furthermore, we intend to investigate the e ect of the earlier elaboration of resampling in the learning process as well as the e ects of other potential resampling techniques. Considering the nal success of n-gram tf-idf vectors, another interesting investigation can be realized on performances of di erent algorithms with di erent n-gram ranges. Finally, improvements in the feature extraction is aimed with the help of additional smart contract related information.

1. Ethereum. https://ethereum.org/, accessed: 2020 -10-25.

2. Aggarwal , C.C. , Zhai , C. : A survey of text classi cation algorithms . In: Mining Text Data ( 2012 )

3. Bartoletti , M. , Pompianu , L.: An empirical analysis of smart contracts: platforms, applications, and design patterns . CoRR abs/1703 .06322 ( 2017 ), http://arxiv.org/abs/1703.06322

4. Bouaziz , A. , Dartigues-Pallez , C. , da Costa Pereira, C. , Precioso , F. , Lloret , P. : Short text classi cation using semantic random forest . In: Bellatreche, L. , Mohania , M.K. (eds.) Data Warehousing and Knowledge Discovery . pp. 288 { 299 . Springer International Publishing, Cham ( 2014 )

5. Chen , T. , Guestrin , C. : Xgboost: A scalable tree boosting system . In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . p. 785 { 794 . KDD ' 16 , Association for Computing Machinery, New York, NY, USA ( 2016 ). https://doi.org/10.1145/2939672.2939785, https://doi.org/10.1145/2939672.2939785

6. Chen , W. , Zheng , Z. , Ngai , E. , Zheng , P. , Zhou , Y. : Exploiting blockchain data to detect smart ponzi schemes on ethereum . IEEE Access PP , 1 { 1 ( 03 2019 ). https://doi.org/10.1109/ACCESS. 2019 .2905769

7. Chicco , D. , Jurman , G. : The advantages of the matthews correlation coe cient (mcc) over f1 score and accuracy in binary classi cation evaluation . BMC Genomics 21 (1 2020 ). https://doi.org/10.1186/s12864-019-6413-7

8. Colas , F. , Brazdil , P. : Comparison of svm and some older classi cation algorithms in text classi cation tasks . In: Bramer, M. (ed.) Arti cial Intelligence in Theory and Practice . pp. 169 { 178 . Springer

, Boston, MA ( 2006 )

9. Erk , K. : Vector space models of word meaning and phrase meaning: A survey . Language and Linguistics Compass 6 ( 10 ), 635 { 653 ( 2012 ). https://doi.org/10.1002/lnco.362, https://onlinelibrary.wiley.com/doi/abs/10.1002/lnco.362

10. He , H. , Garcia , E.A. : Learning from imbalanced data . IEEE Transactions on Knowledge and Data Engineering 21 ( 9 ), 1263 { 1284 ( 2009 ). https://doi.org/10.1109/TKDE. 2008 .239

11. He , N. , Wu , L. , Wang , H. , Guo , Y. , Jiang , X. : Characterizing code clones in the ethereum smart contract ecosystem ( 2019 )

12. Ifrim , G. , Bakir , G. , Weikum , G. : Fast logistic regression for text categorization with variable-length n-grams. Association for Computing Machinery , New York, NY, USA ( 2008 ). https://doi.org/10.1145/1401890.1401936, https://doi.org/10.1145/1401890.1401936

13. K, S., Joseph , S. : Text classi cation by augmenting bag of words (bow) representation with co-occurrence feature . IOSR Journal of Computer Engineering 16 , 34{ 38 (01 2014 ). https://doi.org/10.9790/ 0661 - 16153438

14. Klein , S. : Smart Contract Design Patterns to assist Blockchain Conceptualization . Master's thesis , University of Cologne, Cologne ( 2019 )

15. Kowsari , K. , Meimandi , K.J. , Heidarysafa , M. , Mendu , S. , Barnes , L.E. , Brown , D.E.: Text classi cation algorithms: A survey . ArXiv abs/ 1904 .08067 ( 2019 )

16. Liu , B. : Learning with positive and unlabeled examples using weighted logistic regression . vol. 20 , pp. 448 { 455 (01 2003 )

17. Mikolov , T. , Chen , K. , Corrado , G. , Dean , J.:

E cient estimation of word representations in vector space (

2013 )

18. Norvill , R. ,

Fiz

Pontiveros , B. , State , R. , Awan , I. , Cullen , A. : Automated labeling of unknown contracts in ethereum . pp. 1 { 6 ( 07 2017 ). https://doi.org/10.1109/ICCCN. 2017 .8038513

19. Robertson , S. : Understanding inverse document frequency: On theoretical arguments for idf . Journal of Documentation - J DOC 60 , 503 { 520 (10 2004 ). https://doi.org/10.1108/00220410410560582

20. Sang-Bum

Kim

, Kyoung-Soo

Han

, Hae-Chang

Rim

, Sung Hyon Myaeng: Some effective techniques for naive bayes text classi cation . IEEE Transactions on Knowledge and Data Engineering 18 ( 11 ), 1457 { 1466 ( 2006 )

21. Sezer , S. : Automated Classi cation of Smart Contracts on Ethereum. Master's thesis , RWTH Aachen University, Aachen ( 2020 )

22. Tann , W.J., Han, X.J. , Gupta , S.S. , Ong , Y. : Towards safer smart contracts: A sequence learning approach to detecting vulnerabilities . CoRR abs/ 1811 .06632 ( 2018 ), http://arxiv.org/abs/ 1811 .06632

23. Tian , G. , Wang , Q. , Zhao , Y. , Guo , L. , Sun , Z. , Lv , L. : Smart contract classi cation with a bi-lstm based approach . IEEE Access 8 , 43806 { 43816 ( 2020 )

24. Wohrer, M. , Zdun , U. : Design patterns for smart contracts in the ethereum ecosystem . In: 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) . pp. 1513 { 1520 ( 2018 )

25. Xu , B. , Guo , X. , Ye , Y., Cheng, J.: An improved random forest classi er for text categorization . JCP 7 , 2913 { 2920 ( 2012 )

26. Yang , Y. , Liu , X.: A re-examination of text categorization methods . In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . p. 42 { 49 . SIGIR ' 99 , Association for Computing Machinery, New York, NY, USA ( 1999 ). https://doi.org/10.1145/312624.312647, https://doi.org/10.1145/312624.312647

27. Zhang , H., Xiao , X. , Mercaldo , F. , Ni , S. , Martinelli , F. , Sangaiah , A.K. : Classi cation of ransomware families with machine learning based on n-gram of opcodes . Future Generation Computer Systems 90 , 211 { 221 ( 2019 ). https://doi.org/https://doi.org/10.1016/j.future. 2018 . 07 .052, http://www.sciencedirect.com/science/article/pii/S0167739X18307325