<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Smart Contract Bytecode for Classi cation on Ethereum</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Schloss Birlinghoven</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sankt Augustin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastnameg@fit.fraunhofer.de</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RWTH Aachen University</institution>
          ,
          <addr-line>Ahornstra e 55, 52074 Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>11</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Due to the increase in smart contracts in Ethereum, a need for proper classi cation has emerged. Although the smart contracts are accessible due to the open nature of the Blockchain, readability is still an issue with respect to the smart contract bytecode. We propose an automated approach for classifying smart contracts that utilize popular text classi cation methods on the opcode translation of the smart contract bytecode in order to overcome this limitation. Our experiments indicate that the decision-tree-based techniques like Random Forest and Xgboost outmatch the traditional classi cation tools like Nave Bayes, Logistic Regression, and SVM once the opcode input is presented as n-gram tfidf vectors.</p>
      </abstract>
      <kwd-group>
        <kwd>Blockchain Text Classi cation Smart Contract</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Being one of the most famous platforms since 2015, Ethereum has enabled
many decentralized applications built on top of smart contracts and has been
home to roughly more than 20 million smart contracts and hundreds of millions
of transactions between the members[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This large amount of data has made it
hard to get an overview of this large amount of smart contracts and their services.
However, to be able to identify similar contracts, with other words to be able
to classify them properly, can ease this problem. Proper classi cation of smart
contracts might also enhance the comprehension of the smart contract
application trends by nding examples for di erent categories and therefore contribute
to the enhancement of the smart contract programming for which e ciency is
essentially needed, and work as a cross-validation tool by providing an overall
sense if the smart contract that a user interacts with actually accomplishes what
it ensures.
      </p>
      <p>
        Classi cation of smart contracts is an interesting challange. The only
accessible data with respect to smart contracts is the compiled contract bytecode since
the smart contracts stay anonymous if the source code is not made public by
the developers[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The number of open-source smart contracts remains
considerably low compared to the ones without published source code and using the
source code for classi cation gives only limited results for the entire network.
Therefore, we suggest the bytecode of the smart contracts to be exploited for the
classi cation instead.
      </p>
      <p>With this work, we examine the probable use of text classi cation methods on
the opcode translation of smart contracts to determine if the analogy between
the instruction sequences in a smart contract opcode and word sequences in
texts can help to represent the bytecode in machine learning applications. In
that regard, we explain the pipeline of our approach and the results obtained by
the application of this pipeline for some prede ned classes.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Smart Contract Classi cation</title>
        <p>
          Automated classi cation means have only been sparsely used for smart contract
analysis and comparison. Bartoletti and Pompianu[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] conducted one of the rst
studies in that regard by analyzing the published source codes of the 811 smart
contracts from Ethereum and 23 from the Bitcoin network and creating ve
categories that represent their data as Financial, Notary, Game, Wallet, and
Library. Closely related, Klein[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and Wohrer and Zdun[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] provide two studies
on smart contract design patterns based on detailed research on literary work.
While Klein identi es ten support and six application patterns, Wohrer and
Zdun suggest 18 patterns that are classi ed into ve categories as Action and
Control, Authorization, Lifecycle, Maintenance, and Security.
        </p>
        <p>
          He et al.[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] present an analysis of the Ethereum network with a motivation to
take a deeper look at the code reuse in smart contracts. They apply a customized
fuzzy hashing logic to create smart contract ngerprints based on opcode and
utilize the edit distance between ngerprints to calculate a similarity score that
can be used for smart contract clustering. This process leads to four distinct
clusters as ERC-20 Clusters, Game Contracts, ICO and AirDrop Contracts, and
Other Contracts.
        </p>
        <p>
          Tian et al.[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] o er a mixed approach based on Bi-LSTM model and Gaussian
LDA to classify smart contracts accurately. They make use of the information
extracted from smart contract source code like code comments in addition to the
source code itself, the application labels fetched from www.stateofthedapps.com,
and the account information. The proposed model is found to outperform
alternative baseline models for the classi cation of smart contracts into one of the
prede ned categories like Entertainment, Tools, Management, Finance, Lottery,
Internet of Things, and Others.
        </p>
        <p>
          There have also been di erent inspirations that lead to other classi cation
approaches. Security vulnerabilities of smart contracts have motivated Tann et
al.[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] into classifying contracts according to having a code vulnerability. They
translate the opcode of the smart contracts into one-hot vector sequences which
are then used in an LSTM model with some case-dependent modi cations. This
approach has led to a tool with high detection accuracy that functions on the
smart contract bytecode and identi es the security vulnerabilities of a smart
contract. Another study led by Chen et al.[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] presented a comparison of di erent
approaches in addition to di erent feature extraction techniques for detecting
Ponzi schemes by using di erent characteristics of smart contracts coupled with
the tf-idf measures of the smart contract opcode. They conclude that the
Random Forest algorithm trained to solve a binary classi cation problem is found
optimal for the detection of Ponzi schemes.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Text Analytics</title>
        <p>
          One major application of machine learning has been text classi cation. The
classi cation problem itself is de ned by Aggarwal and Zhai[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as the utilization of a
set of training entries D = fX1; :::; XN g with a label of a class picked from a set of
k distinct entries with indexes f1; :::; kg to build a classi cation model that maps
the descriptive features of each entry to a class label. In order to adapt this de
nition into texts, texts and documents need to be transformed into a structured
feature space to be able to pro t from the mathematical components of a
classi er model since texts and documents are originally unstructured datasets[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Many popular approaches like the bag-of-words model[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], TF-IDF[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] measure,
and word embeddings like Word2Vec[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] have been suggested for this purpose.
        </p>
        <p>
          Deep learning and machine learning techniques have been the primary means
for traditional text classi cation tasks[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Logistic regression[
          <xref ref-type="bibr" rid="ref12 ref16">12, 16</xref>
          ] and the
Nave Bayes classi er[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], some of the earliest and straightforward classi cation
tools, have been made use of to retrieve information. Some of the other methods
that were investigated and utilized for texts include non-parametric algorithms
like k-nearest neighbor[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], decision-tree-based classi ers such as Random
Forest[
          <xref ref-type="bibr" rid="ref25 ref4">4, 25</xref>
          ] and Xgboost[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], and Support Vector Machine (SVM)[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ][
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Furthermore, the use of the neural network models like LSTM for modeling the complex
relations with the data has also been motivated with the recent enhancements
on deep learning techniques[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>While some of the studies mentioned target classifying di erent smart
contract applications, they do not make sense of the context through the smart
contract bytecode. Furthermore, the studies that utilize the bytecode are built
for a narrow use case and do not address di erent usage of smart contracts. In
our study,we carry the use of bytecode to a broader perspective with the help of
text analytics tools.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        The categories to be predicted in our study consist of the application patterns
suggested by Klein[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as voting, auction, entity management, renting, trading,
and track &amp;trace because of the extensive research comprehensive research
conducted using the literature in addition to the descriptive and well-structured
de nitions provided for the patterns. Since a smart contract can belong to
multiple categories, we decide to apply binary relevance to our problem that enables
using separate data representations and algorithms for di erent categories and
allows for the extension of the number of categories in the future work. Finally,
our experiments are limited to ve text classi cation algorithms, three of which
appear traditional and easy to understand methods namely Nave Bayes,
Logistic Regression, and Support Vector Machine(SVM) and the remaining are
decision-tree-based algorithms like Random Forest and Xgboost due to the
recent popularity they have. Because of the small number of examples in the
dataset, deep learning algorithms (e.g., LSTM) are excluded from the study.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Data Collection &amp; Preprocessing</title>
        <p>The public crypto ethereum dataset in Google BigQuery3, which gets updated
daily through a fully synced node, is utilized to collect bytecode of all the smart
contracts residing in Ethereum and the corresponding addresses. Out of 16 426
716 smart contracts deployed to Ethereum by the data collection day, 6th
January 2020, only 283 898 of them have been found to have a unique bytecode.
While 84 909 contracts are found to have a published source code after querying
etherscan.io, a famous block explorer, the source code of 198 989 contracts
remained unknown. Finally, the opcode representation of each smart contract
bytecode is obtained using an evm bytecode disassembler based on Python named
evmdasm4 and the function names are extracted from the source code to ease
the dataset creation process.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Dataset Creation</title>
        <p>The class-speci c datasets which are later used for training and testing are
created using the smart contracts with the available source code. Because of the
timing constraints of the study and the high number of smart contracts for
labeling, a ltering process based on domain keywords is applied to the
extracted function names to get potential positive and negative examples for a
class dataset. Due to their achievable size, we look at the source code of the
potential positives and label them manually according to the class descriptions.
However, the high number of potential negatives hinder us from getting a
manual investigation over the source code for labeling. Thus, the lter keywords for
potential negatives are extended via recursive analysis of the randomly sampled
function names and the ltering is kept as strict as possible to refrain from any
mislabeled examples in the dataset.</p>
        <p>To illustrate the ltering keyword choices better, the Voting class can be
examined. The contracts that include directly use case related words vote or
3 https://cloud.google.com/bigquery
4 https://github.com/ethereum/evmdasm
ballot keywords in their function names are picked for potential positives. After
randomly sampling function names, they are extended to include the smart
contracts that have both or support and against at the same time. For the potential
negatives, rst a set of negative ltering keywords are created with random
function name sampling to nd voting use case related words and the contracts that
do not have any of the keywords of vote, ballot, voting, proposal, support, and
against are chosen as potential negatives.</p>
        <p>In the end, the positive and negative examples are merged to create a dataset
that represents the class well enough. A summary of this process can be seen
in Fig.1. Since there is a di erent dataset obtained for each class, it is possible
to nd examples that are positive for one class but negative for another or even
positive for multiple classes, e.g. some gaming contracts which apply both entity
management and auctioning.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Model Selection</title>
        <p>
          Initially, the class dataset is split into a larger training&amp;validation set that is
used for model training and a smaller test set to determine the generalization
ability of the nal model. The opcodes of the smart contracts inside the
training&amp;validation set are transformed into a feature space using three di erent
vector representations including a count vector (e.g.[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) which consists of the
instruction frequencies inside an opcode, a tf-idf vector which indicates the tf-idf
weights of instructions inside an opcode, and an n-gram tf-idf vector (e.g. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ])
that displays the tf-idf (term frequency-inverse document frequency) weights of
n successive instructions inside an opcode. The default value for the n value
in initial experiments are set as (9-10) as we nd it convenient through our
preliminary experiments. Afterward, experiments on the training&amp;validation set
are conducted to obtain an average score using the k-fold cross-validation
technique about one of the ve chosen algorithms (Nave Bayes, Logistic Regression,
SVM, Random Forest, and Xgboost) coupled with one of these vector
representations. Although there are several evaluation metrics that suit highly
imbalanced datasets, the comparison and nal evaluation metric is chosen to be the
Matthews Correlation Coe cient (MCC) as it is not independent from the true
negatives[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and in our case identifying both cases correctly is important. The
de nition of MCC can be seen in Equation 1.
        </p>
        <p>MCC =</p>
        <p>T P</p>
        <p>T N</p>
        <p>F P</p>
        <p>F N
p(T P + F P ) (T P + F N ) (T N + F P ) (T N + F N )
(1)
where TP, FP, TN, and FN indicate the number of true positive, false positive,
true negative, and false negative predictions on the test set, respectively. It takes
a value between 1 which indicates a complete disagreement between the real
values and predictions and 1 which corresponds to a perfect classi er. A random
classi er is expected to take a value around 0.</p>
        <p>The highest scoring algorithm and data representation are then selected for
additional improvements. A summary of the model selection phase can be seen
in Fig. 2.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Model Improvement</title>
        <p>Two potential spots for enhancement have been identi ed with regard to the
selected algorithm and data representation. The rst one is applicable if n-gram
tf-idf vector is found to be the highest-scoring data representation by trying out
di erent n-gram ranges. This is realized by following the same methodology as
in the model selection step by testing out di erent n-gram ranges depending on
how much it changes the average score obtained via the k-fold cross-validation
technique to increase or decrease the range. As higher n values cause drastically
increased computation power, the range is kept as small as possible if the increase
in the scores is not meaningful. A summary of this step can be seen in Fig. 3.</p>
        <p>
          The second enhancement addresses the imbalance problem. It tests di erent
resampling ratios for positive and negative examples within the training set and
compares the results obtained on the validation set once the training&amp;validation
set is split into separate sets as a training set and validation set. For resampling,
either the positive examples are increased by adding copies of randomly chosen
positive examples, or the negative examples are decreased by the removal of
some random ones to make the ratio of positive and negative examples better
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. However, a drastic change is avoided to refrain from over tting or loss of
information. At the end of this process, if the test scores obtained via applying
the model on the test set is compatible with the best validation scores, the
nal model is applied to all the smart contracts to get the nal predictions for
belonging to the trained class. A summary of this step can be seen in Fig. 4.
The dataset creation step creates class datasets with di erent distributions which
are depicted in Table 1. The Track&amp;Trace class is excluded from the study and
left for future work as there were not any representative ltering domain words
found. For the remaining classes, a high imbalance in favor of negative examples
can be observed.
        </p>
        <p>The three highest scoring algorithms and data representations are depicted
in Table 2. While for some classes like Voting and Renting, the chosen method
and data representation outperforms other options by a large margin, the score
di erence between the top models is not so large for the other classes. On the
other hand, the success of decision-tree-based algorithms compared to other
tested algorithms becomes apparent. Furthermore, n-gram tf-idf vector is found
to be the most successful data structure to represent the smart contract opcode.</p>
        <p>
          Table 3 shows the algorithm, data representation, and corresponding
average or nal MCC score by the application of the model after each step of the
learning process for each class. Model Improvement I and Model Improvement
II correspond to the testing of di erent n-gram ranges and resampling ratios,
respectively. If there is no improvement observed for a step, no MCC score is given.
However, it can already be observed that the e ect of the model improvement
phase is comparably small when the overall scores are considered. Finally, since
there were no prior studies based on the same dataset, we compare our results
to the results obtained by the application of a random predictor that randomly
assigns 0 or 1 for belonging to the class. The MCC score comparisons of the test
results of the nal models and random model results show that the classi ers
trained on the opcode representations of the smart contract bytecode outperform
the random models for each class which is promising for further applications on
the smart contract bytecode. Even though almost all the classi ers seem to do
a decent job, further investigations are needed for the underperformance of the
Renting classi er even though the fairly lower amount of positive examples in
the dataset might be a factor. Further details about the methodology and results
can be found in [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Threats to Validity</title>
        <p>There have been two important issues found that might pose a threat to the
validity of the test outcomes. While creating the class datasets, the ltering is
kept very strict to remove the possible examples with false labels. For example,
any smart contract with a buy function is discarded from the Trading class. This
might cause the class dataset to be unrepresentative for all smart contracts with
various application results. Hence, the results obtained on the test set, which is
also derived from the class dataset, that shows how well the model generalizes
might be inadequate. Another possibly problematic issue can be related to the
disassembler used to get the opcode representations of the smart contracts. Some
of the commands in smart contract bytecode were unknown and were discarded
during the bytecode to opcode translation whose e ects are not explored in our
study.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion &amp; Future Work</title>
      <p>So far, the classi cation of smart contracts with respect to application patterns
has been based on code analysis. We employed text analytics to analyze smart
contracts on the bytecode level. This analysis has allowed us to derive di erent
smart contract use cases. This approach is easy to implement due to its few
processing stages and applicable to a larger amount of data since it is not
sourcecode dependent once it is trained. If there is enough data available for training,
the nal test results show a good rate of classi cation and makes it promising
to treat the smart contract opcode like a text for future applications.</p>
      <p>For future work, we plan to extend our approach with a better labeling
process of the smart contracts to be able to tackle the representativeness problem
of the class datasets. Furthermore, we intend to investigate the e ect of the
earlier elaboration of resampling in the learning process as well as the e ects of
other potential resampling techniques. Considering the nal success of n-gram
tf-idf vectors, another interesting investigation can be realized on performances
of di erent algorithms with di erent n-gram ranges. Finally, improvements in the
feature extraction is aimed with the help of additional smart contract related
information.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Ethereum. https://ethereum.org/, accessed:
          <fpage>2020</fpage>
          -10-25.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A survey of text classi cation algorithms</article-title>
          .
          <source>In: Mining Text Data</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bartoletti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pompianu</surname>
            ,
            <given-names>L.:</given-names>
          </string-name>
          <article-title>An empirical analysis of smart contracts: platforms, applications, and design patterns</article-title>
          .
          <source>CoRR abs/1703</source>
          .06322 (
          <year>2017</year>
          ), http://arxiv.org/abs/1703.06322
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bouaziz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dartigues-Pallez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , da Costa Pereira,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Precioso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Lloret</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Short text classi cation using semantic random forest</article-title>
          . In: Bellatreche,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Mohania</surname>
          </string-name>
          , M.K. (eds.)
          <source>Data Warehousing and Knowledge Discovery</source>
          . pp.
          <volume>288</volume>
          {
          <fpage>299</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . p.
          <volume>785</volume>
          {
          <fpage>794</fpage>
          . KDD '
          <volume>16</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2016</year>
          ). https://doi.org/10.1145/2939672.2939785, https://doi.org/10.1145/2939672.2939785
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngai</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Exploiting blockchain data to detect smart ponzi schemes on ethereum</article-title>
          .
          <source>IEEE Access PP</source>
          ,
          <volume>1</volume>
          {
          <issue>1</issue>
          (
          <issue>03</issue>
          <year>2019</year>
          ). https://doi.org/10.1109/ACCESS.
          <year>2019</year>
          .2905769
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chicco</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The advantages of the matthews correlation coe cient (mcc) over f1 score and accuracy in binary classi cation evaluation</article-title>
          .
          <source>BMC Genomics 21 (1</source>
          <year>2020</year>
          ). https://doi.org/10.1186/s12864-019-6413-7
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Colas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Comparison of svm and some older classi cation algorithms in text classi cation tasks</article-title>
          . In: Bramer, M. (ed.)
          <source>Arti cial Intelligence in Theory and Practice</source>
          . pp.
          <volume>169</volume>
          {
          <fpage>178</fpage>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , Boston, MA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Erk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Vector space models of word meaning and phrase meaning: A survey</article-title>
          .
          <source>Language and Linguistics Compass</source>
          <volume>6</volume>
          (
          <issue>10</issue>
          ),
          <volume>635</volume>
          {
          <fpage>653</fpage>
          (
          <year>2012</year>
          ). https://doi.org/10.1002/lnco.362, https://onlinelibrary.wiley.com/doi/abs/10.1002/lnco.362
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Learning from imbalanced data</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>21</volume>
          (
          <issue>9</issue>
          ),
          <volume>1263</volume>
          {
          <fpage>1284</fpage>
          (
          <year>2009</year>
          ). https://doi.org/10.1109/TKDE.
          <year>2008</year>
          .239
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Characterizing code clones in the ethereum smart contract ecosystem (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ifrim</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bakir</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Fast logistic regression for text categorization with variable-length n-grams. Association for Computing Machinery</article-title>
          , New York, NY, USA (
          <year>2008</year>
          ). https://doi.org/10.1145/1401890.1401936, https://doi.org/10.1145/1401890.1401936
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. K, S.,
          <string-name>
            <surname>Joseph</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Text classi cation by augmenting bag of words (bow) representation with co-occurrence feature</article-title>
          .
          <source>IOSR Journal of Computer Engineering</source>
          <volume>16</volume>
          , 34{
          <volume>38</volume>
          (01
          <year>2014</year>
          ). https://doi.org/10.9790/
          <fpage>0661</fpage>
          -
          <lpage>16153438</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Smart Contract Design Patterns to assist Blockchain Conceptualization</article-title>
          .
          <source>Master's thesis</source>
          , University of Cologne, Cologne (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kowsari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meimandi</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heidarysafa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnes</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , D.E.:
          <article-title>Text classi cation algorithms: A survey</article-title>
          . ArXiv abs/
          <year>1904</year>
          .08067 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Learning with positive and unlabeled examples using weighted logistic regression</article-title>
          . vol.
          <volume>20</volume>
          , pp.
          <volume>448</volume>
          {
          <issue>455</issue>
          (01
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Norvill</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Fiz</given-names>
            <surname>Pontiveros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>State</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Awan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Cullen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Automated labeling of unknown contracts in ethereum</article-title>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <issue>07</issue>
          <year>2017</year>
          ). https://doi.org/10.1109/ICCCN.
          <year>2017</year>
          .8038513
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Understanding inverse document frequency: On theoretical arguments for idf</article-title>
          .
          <source>Journal of Documentation - J DOC 60</source>
          ,
          <issue>503</issue>
          {
          <volume>520</volume>
          (10
          <year>2004</year>
          ). https://doi.org/10.1108/00220410410560582
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sang-Bum</surname>
            <given-names>Kim</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kyoung-Soo</surname>
            <given-names>Han</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hae-Chang</surname>
            <given-names>Rim</given-names>
          </string-name>
          , Sung Hyon Myaeng:
          <article-title>Some effective techniques for naive bayes text classi cation</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <volume>1457</volume>
          {
          <fpage>1466</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sezer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>Automated Classi cation of Smart Contracts on Ethereum. Master's thesis</source>
          , RWTH Aachen University, Aachen (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Tann</surname>
            , W.J., Han,
            <given-names>X.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Towards safer smart contracts: A sequence learning approach to detecting vulnerabilities</article-title>
          . CoRR abs/
          <year>1811</year>
          .06632 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1811</year>
          .06632
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Smart contract classi cation with a bi-lstm based approach</article-title>
          .
          <source>IEEE Access 8</source>
          ,
          <issue>43806</issue>
          {
          <fpage>43816</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24. Wohrer,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zdun</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          :
          <article-title>Design patterns for smart contracts in the ethereum ecosystem</article-title>
          .
          <source>In: 2018 IEEE International Conference on Internet of Things (iThings)</source>
          and
          <article-title>IEEE Green Computing and Communications (GreenCom) and</article-title>
          IEEE Cyber,
          <article-title>Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)</article-title>
          . pp.
          <volume>1513</volume>
          {
          <issue>1520</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
          </string-name>
          , Y., Cheng, J.:
          <article-title>An improved random forest classi er for text categorization</article-title>
          .
          <source>JCP 7</source>
          ,
          <issue>2913</issue>
          {
          <fpage>2920</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.:</given-names>
          </string-name>
          <article-title>A re-examination of text categorization methods</article-title>
          .
          <source>In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . p.
          <volume>42</volume>
          {
          <fpage>49</fpage>
          . SIGIR '
          <volume>99</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>1999</year>
          ). https://doi.org/10.1145/312624.312647, https://doi.org/10.1145/312624.312647
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercaldo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sangaiah</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Classi cation of ransomware families with machine learning based on n-gram of opcodes</article-title>
          .
          <source>Future Generation Computer Systems</source>
          <volume>90</volume>
          ,
          <fpage>211</fpage>
          {
          <fpage>221</fpage>
          (
          <year>2019</year>
          ). https://doi.org/https://doi.org/10.1016/j.future.
          <year>2018</year>
          .
          <volume>07</volume>
          .052, http://www.sciencedirect.com/science/article/pii/S0167739X18307325
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>