Exploiting Smart Contract Bytecode for Classification on Ethereum Selin Sezer 1[0000−0003−3635−0160] , Clemens Eyhoff1[0000−0002−5193−9494] , Wolfgang Prinz1,2[0000−0001−6846−5945] , and Thomas Rose1,2[0000−0001−8728−826X] 1 Fraunhofer Institute for Applied Information Technology FIT, Schloss Birlinghoven, 53754 Sankt Augustin, Germany {firstname.lastname}@fit.fraunhofer.de 2 RWTH Aachen University, Ahornstraße 55, 52074 Aachen, Germany Abstract. Due to the increase in smart contracts in Ethereum, a need for proper classification has emerged. Although the smart contracts are accessible due to the open nature of the Blockchain, readability is still an issue with respect to the smart contract bytecode. We propose an auto- mated approach for classifying smart contracts that utilize popular text classification methods on the opcode translation of the smart contract bytecode in order to overcome this limitation. Our experiments indicate that the decision-tree-based techniques like Random Forest and Xgboost outmatch the traditional classification tools like Naı̈ve Bayes, Logistic Regression, and SVM once the opcode input is presented as n-gram tf- idf vectors. Keywords: Blockchain · Text Classification · Smart Contract. 1 Introduction Being one of the most famous platforms since 2015, Ethereum has enabled many decentralized applications built on top of smart contracts and has been home to roughly more than 20 million smart contracts and hundreds of millions of transactions between the members[1]. This large amount of data has made it hard to get an overview of this large amount of smart contracts and their services. However, to be able to identify similar contracts, with other words to be able to classify them properly, can ease this problem. Proper classification of smart contracts might also enhance the comprehension of the smart contract applica- tion trends by finding examples for different categories and therefore contribute to the enhancement of the smart contract programming for which efficiency is essentially needed, and work as a cross-validation tool by providing an overall sense if the smart contract that a user interacts with actually accomplishes what it ensures. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 11 Classification of smart contracts is an interesting challange. The only accessi- ble data with respect to smart contracts is the compiled contract bytecode since the smart contracts stay anonymous if the source code is not made public by the developers[18]. The number of open-source smart contracts remains consid- erably low compared to the ones without published source code and using the source code for classification gives only limited results for the entire network. Therefore, we suggest the bytecode of the smart contracts to be exploited for the classification instead. With this work, we examine the probable use of text classification methods on the opcode translation of smart contracts to determine if the analogy between the instruction sequences in a smart contract opcode and word sequences in texts can help to represent the bytecode in machine learning applications. In that regard, we explain the pipeline of our approach and the results obtained by the application of this pipeline for some predefined classes. 2 Related Work 2.1 Smart Contract Classification Automated classification means have only been sparsely used for smart contract analysis and comparison. Bartoletti and Pompianu[3] conducted one of the first studies in that regard by analyzing the published source codes of the 811 smart contracts from Ethereum and 23 from the Bitcoin network and creating five categories that represent their data as Financial, Notary, Game, Wallet, and Library. Closely related, Klein[14], and Wöhrer and Zdun[24] provide two studies on smart contract design patterns based on detailed research on literary work. While Klein identifies ten support and six application patterns, Wöhrer and Zdun suggest 18 patterns that are classified into five categories as Action and Control, Authorization, Lifecycle, Maintenance, and Security. He et al.[11] present an analysis of the Ethereum network with a motivation to take a deeper look at the code reuse in smart contracts. They apply a customized fuzzy hashing logic to create smart contract fingerprints based on opcode and utilize the edit distance between fingerprints to calculate a similarity score that can be used for smart contract clustering. This process leads to four distinct clusters as ERC-20 Clusters, Game Contracts, ICO and AirDrop Contracts, and Other Contracts. Tian et al.[23] offer a mixed approach based on Bi-LSTM model and Gaussian LDA to classify smart contracts accurately. They make use of the information extracted from smart contract source code like code comments in addition to the source code itself, the application labels fetched from www.stateofthedapps.com, and the account information. The proposed model is found to outperform alter- native baseline models for the classification of smart contracts into one of the predefined categories like Entertainment, Tools, Management, Finance, Lottery, Internet of Things, and Others. There have also been different inspirations that lead to other classification approaches. Security vulnerabilities of smart contracts have motivated Tann et 12 al.[22] into classifying contracts according to having a code vulnerability. They translate the opcode of the smart contracts into one-hot vector sequences which are then used in an LSTM model with some case-dependent modifications. This approach has led to a tool with high detection accuracy that functions on the smart contract bytecode and identifies the security vulnerabilities of a smart con- tract. Another study led by Chen et al.[6] presented a comparison of different approaches in addition to different feature extraction techniques for detecting Ponzi schemes by using different characteristics of smart contracts coupled with the tf-idf measures of the smart contract opcode. They conclude that the Ran- dom Forest algorithm trained to solve a binary classification problem is found optimal for the detection of Ponzi schemes. 2.2 Text Analytics One major application of machine learning has been text classification. The clas- sification problem itself is defined by Aggarwal and Zhai[2] as the utilization of a set of training entries D = {X1 , ..., XN } with a label of a class picked from a set of k distinct entries with indexes {1, ..., k} to build a classification model that maps the descriptive features of each entry to a class label. In order to adapt this defi- nition into texts, texts and documents need to be transformed into a structured feature space to be able to profit from the mathematical components of a clas- sifier model since texts and documents are originally unstructured datasets[15]. Many popular approaches like the bag-of-words model[13], TF-IDF[19] measure, and word embeddings like Word2Vec[17] have been suggested for this purpose. Deep learning and machine learning techniques have been the primary means for traditional text classification tasks[23]. Logistic regression[12, 16] and the Naı̈ve Bayes classifier[20], some of the earliest and straightforward classification tools, have been made use of to retrieve information. Some of the other methods that were investigated and utilized for texts include non-parametric algorithms like k-nearest neighbor[26], decision-tree-based classifiers such as Random For- est[4, 25] and Xgboost[5], and Support Vector Machine (SVM)[8][15]. Further- more, the use of the neural network models like LSTM for modeling the complex relations with the data has also been motivated with the recent enhancements on deep learning techniques[23]. While some of the studies mentioned target classifying different smart con- tract applications, they do not make sense of the context through the smart contract bytecode. Furthermore, the studies that utilize the bytecode are built for a narrow use case and do not address different usage of smart contracts. In our study,we carry the use of bytecode to a broader perspective with the help of text analytics tools. 3 Methodology The categories to be predicted in our study consist of the application patterns suggested by Klein[14] as voting, auction, entity management, renting, trading, 13 and track &trace because of the extensive research comprehensive research con- ducted using the literature in addition to the descriptive and well-structured definitions provided for the patterns. Since a smart contract can belong to mul- tiple categories, we decide to apply binary relevance to our problem that enables using separate data representations and algorithms for different categories and allows for the extension of the number of categories in the future work. Finally, our experiments are limited to five text classification algorithms, three of which appear traditional and easy to understand methods namely Naı̈ve Bayes, Lo- gistic Regression, and Support Vector Machine(SVM) and the remaining are decision-tree-based algorithms like Random Forest and Xgboost due to the re- cent popularity they have. Because of the small number of examples in the dataset, deep learning algorithms (e.g., LSTM) are excluded from the study. 3.1 Data Collection & Preprocessing The public crypto ethereum dataset in Google BigQuery3 , which gets updated daily through a fully synced node, is utilized to collect bytecode of all the smart contracts residing in Ethereum and the corresponding addresses. Out of 16 426 716 smart contracts deployed to Ethereum by the data collection day, 6th Jan- uary 2020, only 283 898 of them have been found to have a unique bytecode. While 84 909 contracts are found to have a published source code after querying etherscan.io, a famous block explorer, the source code of 198 989 contracts re- mained unknown. Finally, the opcode representation of each smart contract byte- code is obtained using an evm bytecode disassembler based on Python named evmdasm4 and the function names are extracted from the source code to ease the dataset creation process. 3.2 Dataset Creation The class-specific datasets which are later used for training and testing are cre- ated using the smart contracts with the available source code. Because of the timing constraints of the study and the high number of smart contracts for labeling, a filtering process based on domain keywords is applied to the ex- tracted function names to get potential positive and negative examples for a class dataset. Due to their achievable size, we look at the source code of the potential positives and label them manually according to the class descriptions. However, the high number of potential negatives hinder us from getting a man- ual investigation over the source code for labeling. Thus, the filter keywords for potential negatives are extended via recursive analysis of the randomly sampled function names and the filtering is kept as strict as possible to refrain from any mislabeled examples in the dataset. To illustrate the filtering keyword choices better, the Voting class can be examined. The contracts that include directly use case related words vote or 3 https://cloud.google.com/bigquery 4 https://github.com/ethereum/evmdasm 14 Fig. 1. Dataset creation sub-steps ballot keywords in their function names are picked for potential positives. After randomly sampling function names, they are extended to include the smart con- tracts that have both or support and against at the same time. For the potential negatives, first a set of negative filtering keywords are created with random func- tion name sampling to find voting use case related words and the contracts that do not have any of the keywords of vote, ballot, voting, proposal, support, and against are chosen as potential negatives. In the end, the positive and negative examples are merged to create a dataset that represents the class well enough. A summary of this process can be seen in Fig.1. Since there is a different dataset obtained for each class, it is possible to find examples that are positive for one class but negative for another or even positive for multiple classes, e.g. some gaming contracts which apply both entity management and auctioning. 3.3 Model Selection Initially, the class dataset is split into a larger training&validation set that is used for model training and a smaller test set to determine the generalization ability of the final model. The opcodes of the smart contracts inside the train- ing&validation set are transformed into a feature space using three different vector representations including a count vector (e.g.[9]) which consists of the in- struction frequencies inside an opcode, a tf-idf vector which indicates the tf-idf weights of instructions inside an opcode, and an n-gram tf-idf vector (e.g. [27]) that displays the tf-idf (term frequency-inverse document frequency) weights of n successive instructions inside an opcode. The default value for the n value in initial experiments are set as (9-10) as we find it convenient through our preliminary experiments. Afterward, experiments on the training&validation set 15 Fig. 2. Model selection sub-steps are conducted to obtain an average score using the k-fold cross-validation tech- nique about one of the five chosen algorithms (Naı̈ve Bayes, Logistic Regression, SVM, Random Forest, and Xgboost) coupled with one of these vector represen- tations. Although there are several evaluation metrics that suit highly imbal- anced datasets, the comparison and final evaluation metric is chosen to be the Matthews Correlation Coefficient (MCC) as it is not independent from the true negatives[7] and in our case identifying both cases correctly is important. The definition of MCC can be seen in Equation 1. TP × TN − FP × FN MCC = p (1) (T P + F P ) (T P + F N ) (T N + F P ) (T N + F N ) where TP, FP, TN, and FN indicate the number of true positive, false positive, true negative, and false negative predictions on the test set, respectively. It takes a value between −1 which indicates a complete disagreement between the real values and predictions and 1 which corresponds to a perfect classifier. A random classifier is expected to take a value around 0. The highest scoring algorithm and data representation are then selected for additional improvements. A summary of the model selection phase can be seen in Fig. 2. 16 3.4 Model Improvement Two potential spots for enhancement have been identified with regard to the selected algorithm and data representation. The first one is applicable if n-gram tf-idf vector is found to be the highest-scoring data representation by trying out different n-gram ranges. This is realized by following the same methodology as in the model selection step by testing out different n-gram ranges depending on how much it changes the average score obtained via the k-fold cross-validation technique to increase or decrease the range. As higher n values cause drastically increased computation power, the range is kept as small as possible if the increase in the scores is not meaningful. A summary of this step can be seen in Fig. 3. Fig. 3. Model improvement via n-gram modifications The second enhancement addresses the imbalance problem. It tests different resampling ratios for positive and negative examples within the training set and compares the results obtained on the validation set once the training&validation set is split into separate sets as a training set and validation set. For resampling, either the positive examples are increased by adding copies of randomly chosen positive examples, or the negative examples are decreased by the removal of some random ones to make the ratio of positive and negative examples better [10]. However, a drastic change is avoided to refrain from overfitting or loss of information. At the end of this process, if the test scores obtained via applying the model on the test set is compatible with the best validation scores, the final model is applied to all the smart contracts to get the final predictions for belonging to the trained class. A summary of this step can be seen in Fig. 4. 17 Fig. 4. Model improvement via resampling 4 Results The dataset creation step creates class datasets with different distributions which are depicted in Table 1. The Track&Trace class is excluded from the study and left for future work as there were not any representative filtering domain words found. For the remaining classes, a high imbalance in favor of negative examples can be observed. Table 1. Distributions of the Class Datasets Class Number of Positive Examples Number of Negative Examples Voting 779 (1%) 73162 (99%) Auction 290 (4%) 6655 (96%) Entity Management 495 (3%) 19274 (97%) Renting 33 (1%) 6052 (99%) Trading 738 (12%) 5658 (88%) The three highest scoring algorithms and data representations are depicted in Table 2. While for some classes like Voting and Renting, the chosen method and data representation outperforms other options by a large margin, the score difference between the top models is not so large for the other classes. On the other hand, the success of decision-tree-based algorithms compared to other tested algorithms becomes apparent. Furthermore, n-gram tf-idf vector is found to be the most successful data structure to represent the smart contract opcode. 18 Table 2. Top Three Algorithms and Data Representations for Each Class Class Algorithm Data Representation MCC Score Random Forest n-gram tf-idf 0.745190 Voting Random Forest count vector 0.626318 Random Forest tf-idf 0.584108 Xgboost n-gram tf-idf 0.828594 Auction Random Forest n-gram tf-idf 0.781024 Xgboost tf-idf 0.766158 Xgboost n-gram tf-idf 0.792610 Entity Management Random Forest count vector 0.763228 Xgboost count vector 0.749616 Xgboost n-gram tf-idf 0.533938 Renting Xgboost count vector 0.361356 Random Forest n-gram tf-idf 0.299207 Xgboost n-gram tf-idf 0.787698 Trading Random Forest n-gram tf-idf 0.751675 Xgboost count vector 0.750335 Table 3 shows the algorithm, data representation, and corresponding aver- age or final MCC score by the application of the model after each step of the learning process for each class. Model Improvement I and Model Improvement II correspond to the testing of different n-gram ranges and resampling ratios, re- spectively. If there is no improvement observed for a step, no MCC score is given. However, it can already be observed that the effect of the model improvement phase is comparably small when the overall scores are considered. Finally, since there were no prior studies based on the same dataset, we compare our results to the results obtained by the application of a random predictor that randomly assigns 0 or 1 for belonging to the class. The MCC score comparisons of the test results of the final models and random model results show that the classifiers trained on the opcode representations of the smart contract bytecode outperform the random models for each class which is promising for further applications on the smart contract bytecode. Even though almost all the classifiers seem to do a decent job, further investigations are needed for the underperformance of the Renting classifier even though the fairly lower amount of positive examples in the dataset might be a factor. Further details about the methodology and results can be found in [21]. Threats to Validity There have been two important issues found that might pose a threat to the validity of the test outcomes. While creating the class datasets, the filtering is kept very strict to remove the possible examples with false labels. For example, any smart contract with a buy function is discarded from the Trading class. This 19 might cause the class dataset to be unrepresentative for all smart contracts with various application results. Hence, the results obtained on the test set, which is also derived from the class dataset, that shows how well the model generalizes might be inadequate. Another possibly problematic issue can be related to the disassembler used to get the opcode representations of the smart contracts. Some of the commands in smart contract bytecode were unknown and were discarded during the bytecode to opcode translation whose effects are not explored in our study. Table 3. Learning Process Results of Each Class Class Step Algorithm Data Representation MCC Model Selection Random Forest (9, 10)-gram tf-idf 0.745190 Model Improvement I Random Forest (9, 10)-gram tf-idf - Voting Model Improvement II Random Forest (9, 10)-gram tf-idf 0.749153 Test Results Random Forest (9, 10)-gram tf-idf 0.780024 Random Model Results - - -0.012987 Model Selection Xgboost (9, 10)-gram tf-idf 0.828594 Model Improvement I Xgboost (7, 12)-gram tf-idf 0.855746 Auction Model Improvement II Xgboost (7, 12)-gram tf-idf - Test Results Xgboost (7, 12)-gram tf-idf 0.746421 Random Model Results - - 0.038154 Model Selection Xgboost (9, 10)-gram tf-idf 0.792610 Model Improvement I Xgboost (7, 15)-gram tf-idf 0.809968 Entity Model Improvement II Xgboost (7, 15)-gram tf-idf 0.876016 Management Test Results Xgboost (7, 15)-gram tf-idf 0.768002 Random Model Results - - 0.003240 Model Selection Xgboost (9, 10)-gram tf-idf 0.533938 Model Improvement I Xgboost (7, 15)-gram tf-idf 0.567797 Renting Model Improvement II Xgboost (7, 15)-gram tf-idf 0.567797 Test Results Xgboost (7, 15)-gram tf-idf 0.630497 Random Model Results - - -0.046200 Model Selection Xgboost (9, 10)-gram tf-idf 0.787698 Model Improvement I Xgboost (7, 15)-gram tf-idf 0.806156 Trading Model Improvement II Xgboost (7, 15)-gram tf-idf - Test Results Xgboost (7, 15)-gram tf-idf 0.782209 Random Model Results - - -0.022187 5 Conclusion & Future Work So far, the classification of smart contracts with respect to application patterns has been based on code analysis. We employed text analytics to analyze smart contracts on the bytecode level. This analysis has allowed us to derive different smart contract use cases. This approach is easy to implement due to its few 20 processing stages and applicable to a larger amount of data since it is not source- code dependent once it is trained. If there is enough data available for training, the final test results show a good rate of classification and makes it promising to treat the smart contract opcode like a text for future applications. For future work, we plan to extend our approach with a better labeling process of the smart contracts to be able to tackle the representativeness problem of the class datasets. Furthermore, we intend to investigate the effect of the earlier elaboration of resampling in the learning process as well as the effects of other potential resampling techniques. Considering the final success of n-gram tf-idf vectors, another interesting investigation can be realized on performances of different algorithms with different n-gram ranges. Finally, improvements in the feature extraction is aimed with the help of additional smart contract related information. References 1. Ethereum. https://ethereum.org/, accessed: 2020-10-25. 2. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Mining Text Data (2012) 3. Bartoletti, M., Pompianu, L.: An empirical analysis of smart contracts: plat- forms, applications, and design patterns. CoRR abs/1703.06322 (2017), http://arxiv.org/abs/1703.06322 4. Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., Lloret, P.: Short text classification using semantic random forest. In: Bellatreche, L., Mohania, M.K. (eds.) Data Warehousing and Knowledge Discovery. pp. 288–299. Springer International Publishing, Cham (2014) 5. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceed- ings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. p. 785–794. KDD ’16, Association for Computing Ma- chinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785, https://doi.org/10.1145/2939672.2939785 6. Chen, W., Zheng, Z., Ngai, E., Zheng, P., Zhou, Y.: Exploiting blockchain data to detect smart ponzi schemes on ethereum. IEEE Access PP, 1–1 (03 2019). https://doi.org/10.1109/ACCESS.2019.2905769 7. Chicco, D., Jurman, G.: The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 21 (1 2020). https://doi.org/10.1186/s12864-019-6413-7 8. Colas, F., Brazdil, P.: Comparison of svm and some older classification algorithms in text classification tasks. In: Bramer, M. (ed.) Artificial Intelligence in Theory and Practice. pp. 169–178. Springer US, Boston, MA (2006) 9. Erk, K.: Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Com- pass 6(10), 635–653 (2012). https://doi.org/10.1002/lnco.362, https://onlinelibrary.wiley.com/doi/abs/10.1002/lnco.362 10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans- actions on Knowledge and Data Engineering 21(9), 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239 21 11. He, N., Wu, L., Wang, H., Guo, Y., Jiang, X.: Characterizing code clones in the ethereum smart contract ecosystem (2019) 12. Ifrim, G., Bakir, G., Weikum, G.: Fast logistic regression for text cate- gorization with variable-length n-grams. Association for Computing Machin- ery, New York, NY, USA (2008). https://doi.org/10.1145/1401890.1401936, https://doi.org/10.1145/1401890.1401936 13. K, S., Joseph, S.: Text classification by augmenting bag of words (bow) representa- tion with co-occurrence feature. IOSR Journal of Computer Engineering 16, 34–38 (01 2014). https://doi.org/10.9790/0661-16153438 14. Klein, S.: Smart Contract Design Patterns to assist Blockchain Conceptualization. Master’s thesis, University of Cologne, Cologne (2019) 15. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L.E., Brown, D.E.: Text classification algorithms: A survey. ArXiv abs/1904.08067 (2019) 16. Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. vol. 20, pp. 448–455 (01 2003) 17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space (2013) 18. Norvill, R., Fiz Pontiveros, B., State, R., Awan, I., Cullen, A.: Auto- mated labeling of unknown contracts in ethereum. pp. 1–6 (07 2017). https://doi.org/10.1109/ICCCN.2017.8038513 19. Robertson, S.: Understanding inverse document frequency: On theoretical ar- guments for idf. Journal of Documentation - J DOC 60, 503–520 (10 2004). https://doi.org/10.1108/00220410410560582 20. Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, Sung Hyon Myaeng: Some ef- fective techniques for naive bayes text classification. IEEE Transactions on Knowl- edge and Data Engineering 18(11), 1457–1466 (2006) 21. Sezer, S.: Automated Classification of Smart Contracts on Ethereum. Master’s thesis, RWTH Aachen University, Aachen (2020) 22. Tann, W.J., Han, X.J., Gupta, S.S., Ong, Y.: Towards safer smart contracts: A sequence learning approach to detecting vulnerabilities. CoRR abs/1811.06632 (2018), http://arxiv.org/abs/1811.06632 23. Tian, G., Wang, Q., Zhao, Y., Guo, L., Sun, Z., Lv, L.: Smart contract classification with a bi-lstm based approach. IEEE Access 8, 43806–43816 (2020) 24. Wöhrer, M., Zdun, U.: Design patterns for smart contracts in the ethereum ecosys- tem. In: 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). pp. 1513–1520 (2018) 25. Xu, B., Guo, X., Ye, Y., Cheng, J.: An improved random forest classifier for text categorization. JCP 7, 2913–2920 (2012) 26. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. p. 42–49. SIGIR ’99, Association for Computing Machinery, New York, NY, USA (1999). https://doi.org/10.1145/312624.312647, https://doi.org/10.1145/312624.312647 27. Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., Sangaiah, A.K.: Classification of ransomware families with machine learning based on n-gram of opcodes. Future Generation Computer Systems 90, 211 – 221 (2019). https://doi.org/https://doi.org/10.1016/j.future.2018.07.052, http://www.sciencedirect.com/science/article/pii/S0167739X18307325 22