Building of Regression Models for Cryptocurrency Price Prediction Kirill Smelyakov 1, Oleksandr Bizkrovnyi 1, Natalia Sharonova 2, Serhii Smelyakov 1, Anastasiya Chupryna 1 1 Kharkiv National University of Radio Electronics, 14 Nauky Ave., Kharkiv, 61166, Ukraine 2 National Technical University "KhPI", Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract This article is an investigation of factors that can affect cryptocurrency price and their usage in regression models to determine which model type and the algorithm itself is best-suited for predicting crypto price. The determination of the best algorithm is based on an experiment that includes training and validation of models. Comparative analysis of models validation results defines the best-suited algorithm; the type of cryptocurrency that is being analyzed is Defi, namely Ethereum; the study is based on a one-year time frame; the paper does not consider political factors and factors of infrastructure destruction that may affect cryptocurrency prices. The factor types, which are used to create regression models, consist of fundamental factors. The technical factors were omitted and can be investigated in other works. Factors include: network statistics, exchange statistics, mining statistics, social statistics, transactions data, etc. The models performance is calculated by regression metrics. The JMH is used to calculate models time to train. Keywords 1 Cryptocurrency, machine learning, price forecasting, prediction model, impacting factors for cryptocurrency price 1. Introduction Cryptocurrency and bitcoin in particular, has demonstrated its value in recent years, and there are now 14 million bitcoins in circulation. Investors speculating on the future possibilities of this new technology have provided much of the current market capitalization, and this will likely continue until a certain degree of price stability and market acceptance is achieved. Beyond the announced price of a cryptocurrency, those who invest in it rely on the perceived "intrinsic value" of the cryptocurrency. This includes the technology itself and the network, the integrity of the cryptographic code and the decentralized network. Blockchain public ledger technology (the underlying cryptocurrency) is capable of disrupting a range of transactions beyond the traditional payment system. These include stocks, bonds and other financial assets whose records are stored digitally and for which there is currently a need for a trusted third party to validate the transaction. At present, a huge number of models, algorithms and technologies have been developed to improve the speed of fraud detection, mining efficiency, cybersecurity and privacy, as well as to improve the efficiency of price forecasting, volatility, portfolio volume and structure, etc. At the same time, algorithms to solve these problems are often unsustainable because they do not take into account a number of important influencing factors. In this regard, it is now relevant to make a deeper analysis of the factors that have an impact on the price formation of cryptocurrency in order to build regression models to predict prices. COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland EMAIL: kyrylo.smelyakov@nure.ua (K. Smelyakov); oleksandr.byzkrovnyi@nure.ua (O.Bizkrovnyi); nvsharonova@ukr.net (N. Sharonova); serhii.smeliakov@nure.ua (S. Smelyakov); anastasiya.chupryna@nure.ua (A. Chupryna) ORCID: 0000-0001-9938-5489 (K. Smelyakov); 0000-0001-9335-442X (O. Bizkrovnyi); 0000-0002-8161-552X (N. Sharonova); 0000-0002- 5791-2479 (S. Smelyakov); 0000-0003-0394-9900 (A. Chupryna) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Works A cryptocurrency is a software with a specific way to use it which allows including them into the currencies market and do trading. The papers [1-3] present a modern review of Cryptocurrency systems, Models and algorithms. In particular, the work [1] has comparative analysis of mining algorithms, in [2, 3] the main Challenges and Opportunities are formulated, as well as an analysis of the basic methods of artificial intelligence, which are used to solve the most important problems of cryptocurrency, related to price forecasting, risks, cybersecurity threats and a number of others. The main idea of the “coin” type of cryptocurrency is the ability to prepare anonymous transactions [4]; work [5] shows the features of the model of decentralized confidential payment system. Furthermore (major restriction regarding the type of cryptocurrency), other types of cryptocurrencies exist, but this research orients just to the investigation of the factors that make an impact on the “coin" type of cryptocurrency [6-8]. All of the cryptocurrencies are software and their price is a difficult analysis of many factors [9]. Globally, there are two parts which contain their own factors: product itself (cryptocurrency as product and their mechanism) and trading market. Nothing lives outside the environment and crypto is not an exclusion. The relationship between the choice of factors, models and algorithms of blockchain cryptocurrency ecosystem functioning are described in papers [10-12]. Crypto is a common software, so general rules that affect any product in this area can affect cryptocurrency too. Examples of such functions are described in [13-15]: each product has competitors; each product should give some specific features to survive; each product depends on the buying ability of potential customers, etc. A cryptocurrency is a specific software with a peculiar mechanism of their work. In general, all of the cryptocurrencies have users and transactions validators, the role that validators play miners, which mine each of the next blocks of the blockchain. As a result, there are two role needs of which need to be addressed. If users will not have the ability to use crypto coins, then crypto will die [16]. Another part of that, If the miners will not have required profit to cover all of the costs, then transactions will be approved with huge delay, which leads to decreasing popularity of crypto and as a result, this reason might be as root cause of cryptocurrency to go from the market [17-19]. The solution of these problems is decisive for the effective application of private models of artificial intelligence, machine learning and computer vision [20, 21], including using these models to improve efficiency algorithms for the formation and processing of network information [22-24]. There are many approaches that use different factors to predict the crypto price. There is an attempt to predict price using GRU, LSTM and bi-LSTM Machine Learning Algorithms [25]. This approach uses the following factors for training data set:  Open price;  High price;  Low price;  Close price;  Date. Factors selection is not enough confident, because the price factors do not give the root cause of values for given factors for particular date, in other words, the factors are not descriptive. Despite on this assumption the model validation process shows the following results (Figure 1, Figure2). There is no description in the article of how the models were trained and validated, but historic price values never can be used for future price prediction. Another work [26] uses technical metrics of cryptocurrency for price prediction. All of the article frameworks attempt to predict the Bitcoin prices starting from five technical indicators:  Simple Moving Average (SMA);  Exponential Moving Average (EMA);  Momentum (MOM);  Moving Average Convergence Divergence (MACD);  Relative Strength Index (RSI). One of the ML frameworks is described below (Figure 3). Figure 1: Actual and predicted price of BTC using the LSTM model [25] Figure 2: Actual and predicted price of BTC using the GRU model [25] Figure 3: Architecture of one stage framework [26] Technical indicators also cannot be a comprehensive data source for ML model training, because technical analysis does not live without fundamental analysis which includes: network statistics, exchanges statistics, worldwide economic state, etc. The main goal of this investigation is to determine the informative factors that can be used for the price forecasting, determine the effectiveness of the usage of regression machine learning models [27] and figure out the best suited algorithm for the mentioned problem. 3. Methods and Materials Consider initial data for the methods and experiments, metrics, factors and methods proposed to solve the problem under consideration. 3.1. Data Description The problem is characterized by time series data because crypto metrics and worldwide economic indices are being updated each day. This dataset includes only worldwide economic data and metrics data for a particular cryptocurrency. The exclusion here is data that can be spread between countries. The crypto price is worldwide, as a result, no data per country can be used in a data set. The factors used in the dataset are not the same dates. While regular exchanges work only 5 days a week during business hours, crypto exchanges work every day around the clock. This fact leads to a need for data preparation. Values that are found at the end of the week are used during the weekends for crypto exchanges. Another data issue that was found is data missing in some places. This issue is resolved by deleting the full row to avoid the creation of incorrect relationships between factors. In case the dataset is large, the gaps recovering is possible, but when the dataset is small, each data row is important to create valid relationships. The dataset cannot be found on the Internet in public access. It consists of different parts that are being retrieved from different sources in CSV format and combined using "Spark" after. Here are the examples of the data that formed the dataset (Table 1, Table 2). Table 1 DJI index data Date Value 01.01.2021 495.15 02.01.2021 495.34 03.01.2021 495.15 04.01.2021 492.8 05.01.2021 495.88 06.01.2021 498.4 Table 2 Ethereum Inflow Exchanges data Date/Time Aggregated Exchanges Price 01.01.2021 434993.6207041356 733.425 02.01.2021 609389.5356033011 752.49 03.01.2021 1441436.1960446832 890.94 04.01.2021 1793421.0470202232 1026.57 05.01.2021 1081812.942707662 1054.795 06.01.2021 1117149.8606023395 1136.655 The “Price” column is not used during forming the dataset, because a separate CSV file with crypto price and date exists. The dataset is available by the following link [28]. About data source. There are a huge amount of different ways for retrieving particular info regarding particular cryptocurrencies. For example, to retrieve network and mining data, users can run nodes for a particular cryptocurrency and aggregate required information. This way is time and resource consumptions. The social media data can be retrieved from required portals such as GitHub, Telegram, Twitter, in a direct way, but these are raw data that requires preparing for extracting data sentiment. In general, many services avoid such resources and time costs. The “IntoTheBlock” was selected as the source data provider because this system allows a 7-day trial and already has all precomputed and aggregated data that was mentioned above. Also, worldwide economic information for the dataset can be retrieved from the following resource [29]. 3.2. Informative Factors’ Selection The mentioned above global factors give understanding which of them may signalize the price direction. There is the following list of the factors from different areas that were selected to build ML models. 3.2.1. Worldwide Data There are factors that are not directly related to cryptocurrencies, but allegedly affect them. These are global economic factors that demonstrate global economic behavior that may affect the demand for cryptocurrency. S&P 1200 – factor means the global economic situation in the world based on indexes of 1200 biggest companies. This factor was included as an index of investor buying ability. If the world economic situation will be better, then more investors may spend funds to buy such volatile investments as cryptocurrency. Dow Jones Global – factor means the economic state of industrial companies. This factor also can be used as mentioned before. 3.2.2. Ethereum Data The following factors are closely related to cryptocurrencies and their work principles. In general, these metrics show different aspects inside cryptocurrencies during their lifecycle. Inflow volume – total amount (in $ or tokens) entering exchange(s) deposit wallets. All exchanges refer to all supported exchanges. The sharp jumps of inflows tend to coincide with and sometimes precede periods of high volatility. This can potentially be interpreted as a sign of holders looking to sell in centralized exchanges. Outflow volume – total amount (in $ or tokens) leaving exchange(s) withdrawal wallets. All exchanges refer to all supported exchanges. Outflow Volume often spikes following either a crash or a significant break-out. This could potentially be interpreted as users going long and opting to hold their crypto outside centralized exchanges. ETH Price – a dependent variable in the regression model. Means Ethereum price. ETH – BTC correlation – factor is used to display correlation between prices of largest cryptocurrencies. If some product loses the buyer's confidence, then other products in that sphere may lose it too. Large transaction – indicator shows transactions where an amount greater than $100,000 USD was transferred. Transaction volume is USD – large transactions are those where an amount greater than $100,000 USD was transferred. In this case, the Large Transactions Volume in USD indicator measures the aggregate dollar amount transferred in such transactions. Large transaction Volume metric shows the total amount transacted by whales players in a given day. This indicator may give an idea of changes in the cryptocurrency market if huge amounts of crypto volume transfers between addresses. Transaction count – indicator displays activity in blockchain networks which can show the general market behavior. If the transaction count increases, then popularity of the industry or particular cryptocurrency rises. Miners’ inflows – indicators may point to general miner activity and how much they earn. Huge amount of miner inflow can mean an increased need for them. Miners’ outflows – indicator may point to miner behavior, when they sell their crypto holdings into exchanges. Miners’ reward – metric describes the miner’s reward. In case if reward is low, then crypto currency may be stuck with a long transaction confirmation delay. Also, if the huge reward consists of fees that users pay, then popularity of crypto may decrease. Average transaction fees – metric can point to increasing cryptocurrency demand. In case, when a huge amount of transactions exists in the queue, customers start to pay extra fee to up their transaction confirmation. Average transaction volume – transaction volume can indicate both trading and non-speculative activity. Similar to trading volume observed in exchanges, transaction volume can be useful for identifying reversals and breakouts. GitHub activity is a couple of indicators that refer to this: opened issues, closed issues, watchers count, forks count, opened and closed pull requests count. This may point to an idea of how the cryptocurrency quickly grows. Search trends – indicates how often cryptocurrency rises to the spotlight. The increased attention may indicate upcoming market moves. Telegram sentiment – indicator helps measure traders' emotions. In the case of bitcoin, positive sentiment on Telegram has on several occasions preceded a price movement, as seen in December 2019, April, and June 2020. At the same time, the percentage of messages perceived as negative tends to increase during market crashes. Finally, the total number of messages is indicative of the level of activity in these group chats. It is not necessarily reflected in crypto-activity, but it is worth noting the fluctuations in it as a rough indicator of community engagement. Twitter sentiment – indicate a measurement of the emotions of market participants. Sometimes sentiment can be a leading indicator, as was the case with Ethereum in June and July. In most cases, however, sentiment tends to be a reactive indicator. In other words, there is more positive sentiment when prices are rising and negative sentiment when prices are falling. 3.3. ML Model Validation and Metrics The correctness of the created regression model is a relative value. Despite on how the model validness is determined for classification problems, the regression validation does not include determining count of the “false negative” values. The regression model validity is defined by the following metrics. 1. Mean absolute error 1 (1) 𝑀𝐴𝐸 = 𝑁 ⋅ ∑𝑁 ̂ 𝑡=1|𝑌(𝑡) − 𝑌 (𝑡)|, where N – count of record in the test dataset, 𝑌̂ – predicted value, 𝑌 – real value. 2. Mean square error 1 (2) 𝑀𝑆𝐸 = ⋅ ∑𝑁 ̂ 𝑡=1|(𝑌(𝑡) − 𝑌(𝑡)) |, 2 𝑁 where N – count of record in the test dataset, 𝑌̂ – predicted value, 𝑌 – real value. 3. Root mean square error 𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸. (3) 4. Explained Variance ̂ 𝑉𝐴𝑅(𝑦−𝑦) (4) 𝑉𝐴𝑅 = 1 − 𝑉𝐴𝑅(𝑦) , where 𝑌̂ – predicted value, 𝑌 – real value. The R2 metric decided to not include into model validation metric set. Despite the same R- squared statistic produced, the predictive validity would be rather different depending on what the true dependency is. If it is truly linear, then the predictive accuracy would be quite good. Otherwise, it will be much poorer. In this sense, R-Squared is not a good measure of predictive error. 3.4. ML Models and Methods There are bunches of different machine learning regression algorithms that can be selected. First of all, regression model selection should be based on requirements and data specifications based on which the model will be trained and validated. This research focuses on time series data, because cryptocurrency price changes continuously, depending on the selected time frame. There are many types of regression models, but this article focuses on a nonlinear model. There are a few reasons for these algorithms types selection:  Logistic regression is not suitable because that algorithm has only two values (1, 0) for dependent variable. Investigated problem has time series data;  The relationships between data variables that was explained above is not linear, because increasing one variable in couple of decreasing another one variable may affect the dependent variable in an unpredictable way. That means that usage of linear regression algorithms can lead to incorrect data fitting;  Polynomial regression models a non-linear dataset using a linear model. It works in a similar way to multiple linear regression (which is just linear regression but with multiple independent variables) but uses a non-linear curve. It is used when data points are present in a non-linear fashion [30]. This algorithm is not fit for the current purpose, because there is a need to make rules or decisions instead of calculating an average between all points to “draw line”. The best way here is usage of non-linear regression algorithms. There are three representers of non- linear algorithms will be used:  Decision trees;  Random forest;  Gradient boosted trees. The decision tree regression has as a main function is to split the dataset into smaller sets. The subsets of the dataset are created to plot the value of any data point that connects to the problem statement. The splitting of the data set by this algorithm results in a decision tree that has decision and leaf nodes. ML experts prefer this model in cases where there is not enough change in the data set [31]. The decision tree algorithm has hyperparameters for model tuning. One of them is tree depth. If the maximum depth of the tree is set too high, the decision trees learn too fine details of the training data and learn from the noise, i.e. they overfit (Figure 4). As a result, to make a good fitted model, the optimal count of tree depth is required. The optimal count of depth can be found experimentally. If the regression model validity metrics have a good performance for the training dataset, but on the test dataset is low, then the model overfitting happens. There is a need to decrease the tree's depth. The random forest is also a widely-used algorithm for non-linear regression in Machine Learning. Unlike decision tree regression (single tree), a random forest uses multiple decision trees for predicting the output. Random data points are selected from the given dataset (say k data points are selected), and a decision tree is built with them via this algorithm. Several decision trees are then modeled that predict the value of any new data point. There is an example on Figure 5 of how a random forest algorithm works. Since there are multiple decision trees, multiple output values will be predicted via a random forest algorithm. You have to find the average of all the predicted values for a new data point to compute the final output. The only drawback of using a random forest algorithm is that it requires more input in terms of training. This happens due to the large number of decision trees mapped under this algorithm, as it requires more computational power [32]. Figure 4: Representation of how the overfitting looks like [32] Figure 5: Explanation of random forest algorithm principle [33] The Gradient Boosted Regression Trees (GBRT) model (also called Gradient Boosted Machine or GBM) is one of the most effective machine learning models for predictive analytics, making it an industrial workhorse for machine learning. The Boosted Trees Model is a type of additive model that makes predictions by combining decisions from a sequence of base models. For boosted trees model, each base classifier is a simple decision tree. This broad technique of using multiple models to obtain better predictive performance is called model ensembling. Unlike Random Forest which constructs all the base classifier independently, each using a subsample of data, GBRT uses a particular model assembling technique called gradient boosting [34]. 4. Experiment The main goal of this experiment is training of selected regression models, and determining which of the models is the best for small amounts of data. The experiment consists of two steps:  Experiment planning;  Results overview. 4.1. Experiment Planning First of all, experiment requires creation of the dataset. This point is achieved by manually downloading the sources and making hierarchical folder structure for convenient files accessing. All of the source files have a column which describe a day when event is happened, and this column is used to merge each of the source files into full dataset. The dataset is being created or combined from different sources using Apache Spark Framework abilities. Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. The next step is splitting the dataset into two parts in the ratio of 70% and 30%. This action is required to train and test our model. The spark.mllib supports decision trees for binary and multiclass classification, as well as for regression, using both continuous and categorical features. The implementation splits the data by rows, which allows distributed training with millions of instances. Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine multiple decision trees to reduce the risk of overshoot. Like decision trees, random forests work with categorical features, extend to multi-class classification, do not require feature scaling, and can account for non-linearity and feature interaction. The spark.mllib supports random forests for binary and multi-class classification as well as regression, using both continuous and categorical features. The spark.mllib implements random forests using an existing decision tree implementation. Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees to minimize the loss function. Like decision trees, GBTs work with categorical features, extend to multi-class classification, do not require feature scaling, and are able to account for non-linearities and feature interactions. The spark.mllib supports GBT for binary classification and regression using both continuous and categorical features. The spark.mllib implements GBT using an existing decision tree implementation. For more information about decision trees, see the decision tree guide. Note that GBTs do not yet support multi-class classification. Use decision trees or Random Forests to solve multi-class problems. After the dataset is created, ML models will be trained and validated using metrics that were mentioned above. An important note is a calculation of training time for each of the models during the training process to determine the fastest model for the proposed dataset. The next step of the experiment is the validation of trained models, providing info about that and determining the correctness of assumption that selected factors have relationships with ETH price. 4.2. ML Models Training The data that is selected to train the regression models is in a one-year time frame, because that time period has robust rules of market behavior for particular cryptocurrencies. The models training process that will be recorded below will be prepared with the best suited hyperparameters for this data. The Decision Tree max depth: 30, The Random Forest max depth: 5, Gradient Boosted Tree: max iterations: 10, loss type: absolute, max depth: 5, subsampling rate: 0.4. In general, data is grained by day, so there are 365 examples of how the indicators affect the Ethereum price. This is a small amount of data to make robust forecasting, but the experiment will give more valuable facts of that assumption. The Java 8, Spark framework and MacOS Monterey were selected to build and validate the regression models. Also, hardware consists of 2,6 GHz Quad-Core Intel Core i7, 16 GB 2133 MHz LPDDR3. The microbenchmark for measuring how long the models are being trained was prepared by Java JMH Benchmark. JMH is a Java library for writing benchmarks on the JVM, developed as part of the OpenJDK project. JMH provides a very solid foundation for writing and executing benchmarks whose results will not be corrupted by unwanted virtual machine optimizations. There are the following benchmark modes, check Table 3. Table 3 JMH Benchmark Modes Name Description Throughput Measures the number of operations per second, meaning the number of times per second your benchmark method could be executed. Average Time Measures the average time it takes for the benchmark method to execute (a single execution). Sample Time Measures how long time it takes for the benchmark method to execute, including max, min time etc. Single Shot Time Measures how long time a single benchmark method execution takes to run. This is good to test how it performs under a cold start (no JVM warm up). All Measures all of the above. There are the following measurements of how long the models are being trained on given data, check Table 4. The Average time mode was selected for benchmarking. This investigation uses milliseconds as a time unit. The TimeUnit class contains the following time unit constants:  Nanoseconds;  Microseconds;  Milliseconds;  Seconds;  Minutes;  Hours;  Days. Table 4 Average algorithms’ training time Time, Millis Algorithm 5570 Decision tree algorithm 1400 Random forest algorithm 5960 Gradient boosted decision trees The experiment shows that the quickest algorithm for that amount of data is a Random Forest regressor. 5. Results This investigation part contains aggregated obtained results during testing of created regression models and provides it in a table view. 5.1. Decision Tree Model The next table (Table 5) represents data that is retrieved during Decision Tree model validation. Table 5 Decision tree validation results Metric Value Root Mean Squared Error (RMSE) 285.94868249 Mean absolute error (MAE) 233.59134551 Mean square error (MSE) 81766.649019 Explained Variance 77061.375871 The Table 6 gives an example of output that Decision tree regression model produces. If the time consumption for model training is not a problem, then this algorithm can be used. Table 6 Decision tree prediction results Prediction Price 4537.324 4346.08 4216.365234 4342.58 4730.384277 4283.6 4340.763672 4059.81 3970.181885 3848.18 3970.181885 3883.93 4030.908936 3960.15 4294.453612 3869.35 4269.73291 4076.1 4088.45776 4082.56 4409.93115 4057.3 4486.243164 4079.46 5.2. Random Forest Regression Model The next table (Table 7) represents data that is retrieved during Random forest regression model validation. Table 7 Random forest validation results Metric Value Root Mean Squared Error (RMSE) 238.085738 Mean absolute error (MAE) 158.3889740 Mean square error (MSE) 56684.8187 Explained Variance 31666.02249 The Table 8 gives an example of output that Random forest regression model produces. The error values are larger than for the Decision Tree model, but the time consumption for training is less. Table 8 Random forest prediction results Prediction Price 4274.41063283 4346.08 4176.54161135 4342.58 4242.195490373 4283.6 4186.625616152 4059.81 4049.135305026 3848.18 3940.053935855 3883.93 3898.710454863 3960.15 3635.556222742 3869.35 4306.760494009 4283.6 4207.974993752 4082.56 4142.932871340 4057.3 4154.965331289 4079.46 5.3. Gradient Boosted Trees The next table (Table 9) represents data that is retrieved during Gradient Boosting Trees regression model validation. Table 9 Gradient boosted trees validation results Metric Value Root Mean Squared Error (RMSE) 261.4591287 Mean absolute error (MAE) 223.36118231 Mean square error (MSE) 68360.875985 Explained Variance 41659.384746 As we can show, the Table 10 gives an example of output that Gradient Boosted Trees regression model produces. This algorithm is the most accurate from each other, but training time is an issue here. Table 10 Gradient boosted trees prediction results Prediction Price 4374.544273549 4346.08 4374.58427354 4342.58 4374.74427354 4283.6 4281.1311509146 4059.81 4037.4451253475 3848.18 4037.609813992 3883.93 4037.622749360 3960.15 4038.030281828 3869.35 4280.553806327 4076.1 4280.3500029089 4082.56 4280.70529148001 4057.3 4651.65643435048 4079.46 6. Discussions There are the following aggregated results in charts that display which of the algorithms is the best suited for a particular problem. All charts represent a comparison of algorithms by particular regression accuracy metric. These charts include the following regression validation metrics:  Root Mean Squared Error (RMSE);  Mean absolute error (MAE);  Mean square error (MSE);  Explained Variance. In the end of this section, the general conclusions regarding usage results of mentioned regression algorithms are extracted. The Figure 6 shows algorithms comparison by RMSE metric, which means differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. The smallest error is observed for Random Forest algorithm, along with the smallest training time. The next chart (Figure 7) displays the comparison between algorithms by MAE metric which refers to the magnitude of difference between the prediction of an observation and the true value of that observation. MAE takes the average of absolute errors for a group of predictions and observations as a measurement of the magnitude of errors for the entire group. MAE can also be referred as L1 loss function. The results are the same as previous: Random Forest algorithm has highest accuracy, Gradient Boosted Trees is on the second place and Decision Tree is the last. The next chart (Figure 8) displays the comparison between algorithms by MSE metric which defines as Mean or Average of the square of the difference between actual and estimated values. The Random forest algorithm has the smallest value by that metric. The next chart (Figure 9) displays results for Explained variance (also called explained variation) is used to measure the discrepancy between a model and actual data. In other words, it’s the part of the model’s total variance that is explained by factors that are actually present and aren’t due to error variance. Figure 6: Result comparison between algorithms by RMSE metric Figure 7: Result comparison between algorithms by MAE metric Figure 8: Result comparison between algorithms by MSE metric Figure 9: Result comparison between algorithms by “Explained Variance” metric The higher value of explained variance indicates a stronger strength of association. It also means that you make better predictions. As a result of practical experiment there are the following facts were found:  The decision tree algorithm is the worth regression algorithm here. The biggest time-consuming gives biggest predicting results.  The Random forest algorithm is the most accurate, and time to training is small.  The Gradient Boosted Tree algorithm does not give expected results. There are the following recommendations for improvement of decision tree and gradient boosting regression algorithm to make model more accurate and performance balanced:  The decision tree model may be overfitted, which is often detrimental to the model's performance when you introduce new data. If there is no limit set of a decision tree, it will give you a zero MSE value on training set because in the worst case it will end up making 1 leaf for each observation. Thus, preventing overfitting is of major importance when training a decision tree and it can be done in 2 ways: Setting constraints on tree size (fine-tune hyperparameters) and Tree pruning.  There are two types of parameter in Gradient boosting algorithm to be tuned– tree based and boosting parameters. There are no optimum values for learning rate as low values always work better, given that we train on sufficient number of trees. Though, GBM is robust enough to not overfit with increasing trees, but a high number for a particular learning rate can lead to overfitting. But as we reduce the learning rate and increase trees, the computation becomes expensive and would take a long time to run on standard personal computers. Keeping all this in mind, we can take the following steps to optimize model: 1. Choose a relatively high learning rate. Generally, the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems; 2. Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters; 3. Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here; 4. Lower the learning rate and increase the estimators proportionally to get more robust models. 7. Conclusions The experiment for determining the factors which affect crypto price was setup in this work. Related works give understanding what was already done in scope of this theme. The statement regarding not exhaustive of reviewed works was extracted. The own opinion for investigation was proposed. This assumption is based on relationships between the following metrics: crypto data and worldwide metrics. The assumption of relationships existence here is based on dependencies in the world. It means dependencies between software and how this software is used in real life. Which affects software usage. The next step was determining the best suited family of regression algorithms, and the non-linear family was selected. Also, the metrics which are required for validation were selected too. The experiment execution was the next step. And the Apache Spark was used to create data set from source files and for creating and training the regression models. The JMH tool determined that the random forest algorithm is fastest between others in time elapsed for training perspective. The result of validation of created models gives information that Random Forest algorithm is the most accurate between each other, also as a training time is smallest in comparison to other algorithms. The Gradient Boosted Tree algorithm stays in the middle of performance and Decision Tree algorithm does not suite for prepared data and problem. Further investigations may be focused on including into model additional cryptocurrency factors: spreading crypto between exchanges, more financial factors for the worldwide economic and political situation like country financial institute openness which describes its ready to economic development and infrastructure failures. 8. References [1] U. Mukhopadhyay, A. Skjellum, O. Hambolu, J. Oakley, L. Yu and R. Brooks, "A brief survey of Cryptocurrency systems," 2016 14th Annual Conference on Privacy, Security and Trust (PST), 2016, pp. 745-752, doi: 10.1109/PST.2016.7906988. [2] F. Sabry, W. Labda, A. Erbad and Q. Malluhi, "Cryptocurrencies and Artificial Intelligence: Challenges and Opportunities," in IEEE Access, vol. 8, pp. 175840-175858, 2020, doi: 10.1109/ACCESS.2020.3025211. [3] J. Bonneau, A. Miller, J. Clark, A. Narayanan, J. A. Kroll and E. W. Felten, "SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies," 2015 IEEE Symposium on Security and Privacy, 2015, pp. 104-121, doi: 10.1109/SP.2015.14. [4] F. Béres, I. A. Seres, A. A. Benczúr and M. Quintyne-Collins, "Blockchain is Watching You: Profiling and Deanonymizing Ethereum Users," 2021 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS), 2021, pp. 69-78, doi: 10.1109/DAPPS52256.2021.00013. [5] Yu Chen, Xuecheng Ma, Cong Tang and Man Ho Au, "Pgc: Pretty good decentralized confidential payment system with auditability", Cryptology ePrint Archive Report 2019/319, 2019, [online] Available. URL: https://eprint.iacr.org/2019/319. [6] Understanding The Different Types of Cryptocurrency. URL: https://www.sofi.com/learn/conten t/understanding-the-different-types-of-cryptocurrency. [7] The 10 Most Popular Cryptocurrencies, and What You Should Know About Each Before You Invest. URL: https://time.com/nextadvisor/investing/cryptocurrency/types-of-cryptocurrency. [8] P. Tasatanattakool and C. Techapanupreeda, "Blockchain: Challenges and applications," 2018 International Conference on Information Networking (ICOIN), 2018, pp. 473-475, doi: 10.1109/ICOIN.2018.8343163. [9] A Guide to Cryptocurrency Fundamental Analysis. URL: https://academy.binance.com/en/articles/a-guide-to-cryptocurrency-fundamental-analysis [10] The 7 Key Factors Influencing Cryptocurrency Value. URL: https://www.makeuseof.com/factors- influencing-the-cryptocurrency-value/ [11] S. Boshuis, T. Braam, A. Pedroza Marchena and S. Jansen, "The Effect of Generic Strategies on Software Ecosystem Health: The Case of Cryptocurrency Ecosystems," 2018 IEEE/ACM 1st International Workshop on Software Health (SoHeal), 2018, pp. 10-17. [12] Jiangtao Ma, Yaqiong Qiao, Guangwu Hu, Yongzhong Huang, Arun Kumar Sangaiah, Chaoqin Zhang, et al., "De-anonymizing social networks with random forest classifier", IEEE Access, vol. 6, pp. 10139-10150, 2017. [13] Vynokurova O., Peleshko D., Zhernova P., Perova I., Kovalenko A. (2021) Solving Fraud Detection Tasks Based on Wavelet-Neuro Autoencoder. In: Babichev S., Lytvynenko V., Wójcik W., Vyshemyrskaya S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_34 [14] T. Radivilova, L. Kirichenko, D. Ageiev and V. Bulakh, "Classification Methods of Machine Learning to Detect DDoS Attacks," 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 2019, pp. 207-210, doi: 10.1109/IDAACS.2019.8924406. [15] F. A. Cahyadi, A. I. Owen, F. Ricardo and A. A. S. Gunawan, "Blockchain Technology behind Cryptocurrency and Bitcoin for Commercial Transactions," 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), 2021, pp. 115-119, doi: 10.1109/ICCSAI53272.2021.9609790. [16] How Bitcoin Works. URL: https://www.investopedia.com/news/how-bitcoin-works. [17] S. Pillai, D. Biyani, R. Motghare and D. Karia, "Price Prediction and Notification System for cryptocurrency Share Market Trading," 2021 International Conference on Communication information and Computing Technology (ICCICT), 2021, pp. 1-7, doi: 10.1109/ICCICT50803.2021.9510122. [18] X. Li and C. A. Wang, "The technology and economic determinants of cryptocurrency exchange rates: The case of bitcoin", Decision Support Systems, vol. 95, pp. 49-60, 2017. [19] A. Park, J. Kietzmann, L. Pitt and A. Dabirian, "The Evolution of Nonfungible Tokens: Complexity and Novelty of NFT Use-Cases," in IT Professional, vol. 24, no. 1, pp. 9-14, 1 Jan.- Feb. 2022, doi: 10.1109/MITP.2021.3136055. [20] K. Smelyakov, A. Chupryna, M. Hvozdiev and D. Sandrkin, "Gradational Correction Models Efficiency Analysis of Low-Light Digital Image," 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), 2019, pp. 1-6, doi: 10.1109/eStream.2019.8732174. [21] K. Smelyakov, M. Shupyliuk, V. Martovytskyi, D. Tovchyrechko and O. Ponomarenko, "Efficiency of image convolution," 2019 IEEE 8th International Conference on Advanced Optoelectronics and Lasers (CAOL), 2019, pp. 578-583, doi: 10.1109/CAOL46282.2019.9019450. [22] O. Lemeshko, M. Yevdokymenko, O. Yeremenko, A. M. Hailan, P. Segeč and J. Papán, "Design of the Fast ReRoute QoS Protection Scheme for Bandwidth and Probability of Packet Loss in Software-Defined WAN," 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM), 2019, pp. 1-5, doi: 10.1109/CADSM.2019.8779321. [23] Ageyev D., Radivilova T. Traffic monitoring and abnormality detection methods for decentralized distributed networks // CEUR Workshop Proceedings. 2021. Vol. 2923. P. 283–288. [24] K. Smelyakov, A. Datsenko, V. Skrypka and A. Akhundov, "The Efficiency of Images Reduction Algorithms with Small-Sized and Linear Details," 2019 IEEE International Scientific-Practical Conference Problems of Infocommunications, Science and Technology (PIC S&T), 2019, pp. 745- 750, doi: 10.1109/PICST47496.2019.9061250. [25] A Novel Cryptocurrency Price Prediction Model Using GRU, LSTM and bi-LSTM Machine Learning Algorithms. URL: https://www.mdpi.com/2673-2688/2/4/30/pdf. [26] Predictions of bitcoin prices through machine learning based frameworks. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8022579. [27] Y. Xin et al., "Machine Learning and Deep Learning Methods for Cybersecurity," in IEEE Access, vol. 6, pp. 35365-35381, 2018, doi: 10.1109/ACCESS.2018.2836950. [28] Data Repository. URL: https://drive.google.com/drive/folders/17wcLX2VVw1cCo_6RsCEas2H SiDnZhfj4?usp=sharing. [29] Data for Analysis. URL: https://www.marketwatch.com/investing/index/spg1200/download- data?countrycode=xx. [30] Five Types of Regression Analysis And When To Use Them. URL: https://www.appier.com/blog/5-types-of-regression-analysis-and-when-to-use-them. [31] Eight popular regression algorithms in machine learning of 2021. URL: https://www.jigsawacade my.com/popular-regression-algorithms-ml. [32] DTR. URL: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. [33] The flowchart of random forest (RF) for regression. URL: https://www.researchgate.net/figure/The-flowchart-of-random-forest-RF-for-regression-adapted- from-Rodriguez-Galiano-et_fig3_303835073. [34] The Gradient Boosted Regression Trees (GBRT) model. URL: https://apple.github.io/turicreate/d ocs/userguide/supervised-learning/boosted_trees_regression.html.