=Paper=
{{Paper
|id=Vol-2147/p09
|storemode=property
|title=Algorithmic Trading and Machine Learning Based on GPU
|pdfUrl=https://ceur-ws.org/Vol-2147/p09.pdf
|volume=Vol-2147
|authors=Mantas Vaitonis,Saulius Masteika,Konstantinas Korovkinas
|dblpUrl=https://dblp.org/rec/conf/system/VaitonisMK18
}}
==Algorithmic Trading and Machine Learning Based on GPU==
Algorithmic Trading and Machine Learning Based on GPU Mantas Vaitonis Saulius Masteika Konstantinas Korovkinas Vilnius University Kaunas Faculty Vilnius University Kaunas Faculty Vilnius University Kaunas Faculty Muitinės street. 8, Muitinės street. 8, Muitinės street. 8, LT-44280 Kaunas, Lithuania LT-44280 Kaunas, Lithuania LT-44280 Kaunas, Lithuania mantas.vaitonis@knf.vu.lt saulius.masteika@knf.vu.lt konstantinas.korovkinas@knf.vu.lt Abstract— This paper investigates the speed improvements subsystems, typically endowed with more than 10x higher available when using a graphics processing unit (GPU) for memory bandwidth than a CPU. Peak performance is usually algorithmic trading and machine learning. A modern GPU allows impossible to achieve on general purpose applications, yet hundreds of operations to be performed in parallel, leaving the capturing even a fraction of peak performance yields significant CPU free to execute other jobs. Several issues related to implementing algorithmic trading and machine learning on GPU speedup. GPU performance is dependent on finding high are discussed, including limited programing flexibility, as well as degrees of parallelism: a typical computation running on the the effect that proper memory layout can have on speed increases GPU must express thousands of threads in order to effectively when using GPU devices. An empirical research of algorithmic use the hardware capabilities. Algorithms for machine learning trading on GPU is presented, which showed the advantage of the applications will need to consider such parallelism in order to GPU over CPU system. Moreover the machine learning methods utilize many-core processors. Applications which do not on GPU are presented and the findings of this paper may be express parallelism will not continue improving their applied in future works. performance when run on newer computing platforms at the rates we have enjoyed in the past. Therefore, finding large scale Keywords— high frequency trading; machine learning; GPU; high performance computing; genetic programming. parallelism is important for compute performance in the future. Programming for GPUs is then indicative of the future many- I. INTRODUCTION core programming experience [2]. Nowadays standard computers come with sequential CPUs When searching for “GPU back-testing software” almost no or with multicore CPUs, which allow a limited number of results appear. The technology is very difficult to use and processes to be executed in parallel. What is important here is implement across a general back-testing. that this hardware is strongly parallel and may operate The problem is the way in which a GPU works and the way independent from the main CPU. A modern GPU allows in which general purpose back-testing works. Most of these hundreds of operations to be performed in parallel, leaving the back-testing programs have a language like MQL4, Ninjascript. CPU free to execute other jobs. In particular, GPUs offer These languages are used to construct trading systems that the hundreds of processing cores, but they can be used simulator executes by performing some sort of parsing of the simultaneously only to perform data parallel computations. scripted code. This approach gives flexibility because Moreover, GPUs usually have no direct access to the main researchers can code whichever strategy they can think of with memory and they do not offer hardware managed caches; two whatever logic and the simulator will be able to handle it. The aspects that make memory management a critical factor to be strategy coded is in essence a function that the simulator then carefully considered [1]. uses to execute code within its back-testing engine. However, GPU architectures are specialized for computeintensive, when trying to move this type of thinking to the GPU researches memory-intensive, highly parallel computation, and therefore go into lots of problems [3]. are designed such that more resources are devoted to data The work reported in this paper aims to present literature processing than caching or control flow. State of the art GPUs review of GPU benefits on algorithmic trading and machine provide up to an order of magnitude more peak IEEE single- learning. The overview of the uses of machine learning and precision floating-point than their CPU counterparts. algorithmic trading on GPU is presented. Both topics are Additionally, GPUs have much more aggressive memory presented separately and the results will be used for future works in machine learning with high frequency trading on Copyright held by the author(s). 50 GPU. The paper also presents high frequency algorithmic MATLAB with CUDA specific extensions [5]. When a trading results when applied on CPU and GPU. program using CUDA extensions and running on the CPU invokes a GPU kernel, many copies of this kernel – known as The rest of the paper is organized as follows: theory and the threads – are enumerated and distributed to the available problem statement are presented in Sections 1 and 2. Sections multiprocessors, where their execution starts [4]. 3, 4, 5 and 6 give an overview of: GPU for hardware acceleration, high frequency trading, GPU in high performance The two main criteria for algorithmic trading are speed – computing and GPU in machine learning The results and the that is the speed with which the same set of computations can summary of the research, followed by conclusions in Section 7. be performed on multiple sets of data – and programmability. For this principle, general-purpose hardware – such as Intel II. OBSTACLES USING GPU Central Processing Unit (CPU) – is not suitable. The CPU is GPU is a very limited machine in terms of programming designed to execute commands in a linear fashion, however, the flexibility. It is not possible just to code the system within a task at hand will benefit most from parallelization as the same script and send it to a GPU back-tester. If researchers want the calculations are required to be performed on multiple data; this GPU to perform a trading system simulation they will need to is where parallelization and hardware acceleration come into code the entire system and simulator within the same function play. and have the GPU run that in a batch process. IV. HIGH FREQUENCY TRADING Introducing things like double loops and random access The developments in computer technology have changed patterns is hard for the GPU. When writing simulations for a the way financial instruments are traded. A significant part of GPU it is necessary to ensure that everything that is random trades is handled without human intervention, where trading access intensive or conditional intensive is pre-calculated and algorithms make trading decisions. Although the concept of passed to the GPU. Therefore, something that is “general algorithmic trading is not brand new, the speed in which purpose” starts to become very hard to pre-calculate and algorithmic trading operates has grown tremendously over the interactively build the entire simulator-plus-system code to load past ten years. it into the GPU and perform the simulations. There are many ways in which GPU technology is currently being used in The trade execution time has grown from daily trading to trading. Traditionally they have been used to execute microseconds and even nanoseconds. Due to the increase in simulations that are very specific and parallelizable – such as speed, a huge number of orders and order cancellations are pricing simulations, machine learning training and high required. Profit chances for high frequency traders are very frequency trading algorithms. time-sensitive and low latency for trade execution is of the main importance. Thus, HFT firms invest in high-speed connections When looking for something very general the GPU tends to and place their trading platforms close to the stock market be a hard solution. However, if one is interested in some servers via co-location [6]. particular trading problem then there’s a big chance that researchers would be able to benefit from it if they are willing Nowadays, financial markets are fully automated, to spend the time, energy and money necessary to create a consisting of algorithmic trading, thus, they are largely custom GPU implementation [3][4]. dominated by high frequency trading. High frequency trading platforms have replaced the traditional auction-like floor where III. GPU FOR HARDWARE ACCELERATION traders compete on price [7]. The main focus of HFT is to beat Hardware acceleration is achieved by utilizing specific the time. The algorithm waits till the trader buys a certain hardware to gain higher computational results than those amount of any financial instrument at any given time, then the provided by general purpose CPU. Most devices intended for high frequency traders use this information to change the price intense calculations include Field-Programmable Gate Array it is quoting in the market [8][9][10][11]. The economics and (FPGA), IBM’s Cell Broadband Engine Architecture (Cell BE finance academic community consider HFT as beneficial to the or, simply, Cell) and Graphics Processing Units (GPUs). Until market because HFT provides liquidity and, therefore, recently GPU remained on fringes of HPC (high performance facilitates the flow of commerce in the capital markets [11]. computing) mostly because of the high learning curve caused Given the fact that high frequency trading has to be done in by the fact that low-level graphics languages were the only way milliseconds or even nanoseconds, all trading must be to program the GPUs. However, now NVIDIA has come out performed by using supercomputer. In real life, depending on with a new line of graphics cards – Tesla [4]. the trade, trading opportunities can last from nanoseconds to One of NVIDIA GPUs main features is ease of minutes or even hours. programmability made possible with CUDA – Compute Trading strategies, used by high frequency traders, seek for Unified Device Architecture. With a low learning curve, CUDA the opportunity to exploit short-lived trading in the markets that allows developers to tap into enormous computing power of would not be possible to find or identify in other way than high- GPUs yielding high performance benefits [5]. As mentioned in speed processing power of computers. These trading the introduction, we use the compute unified device architecture opportunities are very small abnormalities in the pricing of (CUDA), which allows for implementation of algorithms using financial instruments that result in extra low profit per trade. 51 High frequency earns higher profit as it is possible to trade in During the research pair detection, detection of buy/sell big volumes. Thus, profit can be generated from these small signals, the trading and profit calculation were parallelized changes in the prices. One of the advantages of HFT is that it when implemented on CPU and GPU [8]. When these functions provides liquidity and helps to ensure the efficiency of prices were parallelized it was no longer necessary to wait for one for financial assets [12]. function to stop and start the other one. The multiple calculations with multiple functions were possible. V. GPU IN HIGH PERFORMANCE COMPUTING The research aim was not to measure the profit of the High-frequency trading (HFT) is a specialized form of strategy but to improve the speed of algorithm by using GPU. Algorithmic trading, where the execution of computerized The same pair trading strategy was applied to CPU and later to trading strategies is characterized by extremely short position- CPU working together with GPU. In the table below we can see holding periods – just a few seconds or even down to the amount of records pairs trading algorithm had to process milliseconds. The success of an HFT algorithm depends on its and how much time did it take using CPU and GPU. ability to react to a situation faster than others. This has given birth to another variant of HFT called Ultra High Frequency TABLE I. CPU and GPU comparison Trading (UHFT). Here, the execution of trades happens in sub- millisecond times. The technology used by UHFT traders is co- GeForce Intel i5 - Number of 710m, 96 location of servers with exchange, direct market access, using Date 3230M 2,6 GHz,2 CUDA Cores records parallel processing on GPUs and using special hardware like cores (in seconds) processed (in seconds) FPGAs [13]. 2015- Consolidated Tape Association(CTA) oversees the 08-03 till 74777,4 58378,53 124789970 collection, processing and dissemination of consolidated quote 2015-08- and trade data at NYSE. Securities Information Processor(SIP), 31 is the technology that enables collecting quote and trade data from the exchanges, consolidating it, and sending it out as a continuous stream of best bids and offers (quotes) and last sales (trades). SIP has to work at enormous speed. On average, NYSE handles average 2 lakh quotes per second out of which 28000 Table 1 shows trading time of algorithm using different per second get converted into trades. The traders talk to the hardware CPU (Intel i5 - 3230M 2,6 GHz,2 cores) and GPU exchanges using FIX protocol. FIX stands for Financial (GeForce 710m, 96 CUDA Cores). The total number of records Information eXchange. The standard is managed by a nonprofit processed was 124789970 for each simulation. organization called FIX Trading Community. The message The more detailed information is presented in figure below consists of ASCII characters and the format is an extension of where the speedup difference is presented. XML, called FIXML. Recently Citibank has announced that it will provide FIX functionality to NSE in India [13]. There is increasing use of High Performance Computing platforms like GPU multiprocessing and FPGA. D.HFT algorithms. They are fast and parallelizable. They are specifically designed to make money by exploiting tiny, lightningfast price changes in shares[13][14]. A. GPU in high frequency trading During our research algorithmic trading strategy [8] was Fig. 1. Comparison of CPU and GPU using HFT in seconds used on CPU Intel i5 - 3230M 2,6 GHz with two cores (2 As shown in figure above the pair trading algorithm speed MATLAB worker) and GPU GeForce 710M with 96 CUDA of simulation did improve varying from 12% to 36% when used cores. Firstly we applied the pair trading strategy only on CPU on GPU instead of just CPU. The difference of speed for and then on CPU together working with GPU. different days occurs due to different number of trades made The nanosecond data used for experiment was provided by and different number of trade signals. The more parameters are Nanotick company. Futures contracts were from ME group possible to make parallel and move to GPU, the bigger speedup which consists of NYMEX, COMEX and CBOT. Nanotick is possible to achieve. During this experiment bigger the matrix provided five different futures commodity contracts: NG of trades and pairs were used the more measurable was the (natural gas), BZ (Brent crude oil), CL (crude oil), HO (NY speed up by GPU. The results show the importance of technical Harbor ULSD) , RB (RBOB Gasoline). Time period of advantages in HFT and how important is to improve the commodity futures contracts was from 01-08-2015 to 31-08- algorithm in order to use the most of the hardware it is presented 2015. to. 52 B. Stock trading using genetic programming on GPU well suited to be implemented on graphic processors. Raina et D. McKenney and T. White [14] did present their research al. in [21] developed general principles for massively on stock trading using genetic programing on GPU. Within this parallelizing unsupervised learning tasks using graphics work, genetic programming was used in an attempt to solve the processors and shown that these principles can be applied to real-world problem of stock trading strategy generation. A GPU successfully scaling up learning algorithms for both deep belief device was used to evaluate individuals within the GP networks (DBNs) and sparse coding. Their implementation of population through stack-based interpretation (due to the lack DBN learning is up to 70 times faster than a dual-core CPU of recursion support on many GPU devices). With a small implementation for large models. Dean et al. in [22] presented amount of memory access optimization, a speedup factor of that training large deep learning models with billions of over 600 was reached when compared to a sequential evaluation parameters using 16000 CPU cores could dramatically improve of the same data running on a 2.4Ghz CPU. The effect of training performance. Krizhevsky et al. in [29] showed that increasing the size of the training set (through the addition of training a large deep convolutional network with 60 million more stocks and longer training periods) was also investigated. parameters and 650,000 neurons on a large data set was in great It was found that using small training sets resulted in the worst performance based on GPU processors [16]. Coates et al. in testing results. Furthermore, the best test results were found [24] presented their own system based on Commodity Off-The- when using the largest training sets. These results supported the Shelf High Performance Computing (COTS HPC) technology: hypothesis that analyzing more stocks over a longer period of a cluster of GPU servers with Infiniband interconnects and time can generate a more general and effective stock trading MPI. Their system is able to train 1 billion parameter networks strategy. The speedup gained using GPU devices for evaluation on just 3 machines in a couple of days, and they showed that it enable this large training set to be evaluated quickly, while a can scale to networks with over 11 billion parameters using just sequential implementation would make this approach 16 machines. They have shown that can comfortably train unfeasible. Finally, several areas of improvement for both GP networks with well over 11 billion parameters—more than 6.5 on GPU and stock trading strategy creation using GP were times as large as the one reported in [22] (the largest previous identified. Continuing work and addressing these possible areas network), and using fewer than 2% as many machines. Chen et of improvement may result in faster evaluation of individuals, al. in [25] implemented a variant of the deep belief network as well as a much more profitable trading solution [14]. (DBNs), called folded-DBN, on NVIDA’s Tesla K20 GPU. Results showed, that comparing execution time of the fine- VI. GPU IN MACHINE LEARNING tuning process, the GPU implementation results 7 to 11 times The use of GPUs in machine learning is widely used in speedup over the CPU platform. recent years. The most promising machine learning algorithm Others authors in their researches also approved that is SVM, that can be conveniently adapted to parallel proposed models on GPU achieved the better results. Hung and architectures. During the last decade, many works have been Wang in [26] proposed a GPU-accelerated PSO (GPSO) done for accelerating the time-consuming training phase in algorithm that uses the NVIDIA Tesla C1060 GPU to improve SVM on many-core GPUs. Catanzaro et al. in [2] first proposed the timing efficiency of PSO. Numerical results showed that the the GPUSVM for binary classification problem and achieved GPU architecture fits the PSO framework well by reducing speedup of 9-35× over LIBSVM running on a traditional computational timing, achieving high parallel efficiency and processor. Later Herrero-Lopez et al. in [18] improved finding better optimal solutions by using a large number of Catanzaro’s work by adding the support for Multiclass particles. Cai et al. in [27] proposed approach to forecast large classification. They achieved the speedups in the range of 3-57x scale conditional volatility and covariance using neural network for training and 3-112x for classification. Carpenter in [19] on GPU. Tran and Cambria in [28] developed an ensemble presented cuSVM, a software package for high-speed Support application of extreme learning machine (ELM) and GPU for Vector Machine (SVM) training and prediction that exploits the real-time multimodal sentiment analysis that leverages on the massively parallel processing power of Graphics Processors power of sentic memes (basic inputs of sentiments that can (GPUs). Other authors in papers [15][17][23] also reported that generate most human emotions). Their proposed multimodal GPU optimization of SVM achieves better performance to system is shown to achieve an accuracy of 78%. In term of compare with CPU. Vaněk et al. in [20]. introduced a novel processing speed, their method shows improvements of several GPU approach of the support vector machine training: orders of magnitude for feature extraction compared to CPU- Optimized Hierarchical Decomposition SVM (OHD-SVM). It based counterparts. uses a hierarchical decomposition iterative algorithm that allows using matrix-matrix multiplication to calculate the VII. CONCLUSIONS kernel matrix values. They declared that algorithm is In this article we have presented both the opportunities and significantly faster than all other implementations for all challenges of the algorithmic trading and machine learning datasets. The biggest difference was on the largest datasets approach on GPU. The empirical study of algorithmic trading where they achieved speed-up up to 12 times in comparison on GPU was presented, which proved the advantage of GPU with the fastest already published GPU implementation. versus CPU. Another challenging research area is Deep Learning, which largely involve simple matrix manipulations and are therefore 53 High frequency trading and machine learning is new and [17] Sopyła K., Drozda P., Górecki P. (2012), “SVM with CUDA accelerated kernels for big sparse problems”. In International Conference on Artificial growing phenomenon. It provides interesting research Intelligence and Soft Computing, pp. 439-447. opportunities in Financial management, market dynamics, [18] Herrero-Lopez S., Williams J. R., Sanchez A. (2010), “Parallel multiclass FPGA hardware, multicomputing on platforms like CUDA. classification using SVMs on GPUs”. In Proceedings of the 3rd Workshop on general-purpose computation on graphics processing units, pp. 2-11. Review of works in the area of machine learning based on [19] Carpenter A. U. S. T. I. N. (2009), “CUSVM: A CUDA implementation GPU is also presented in this paper and led to the conclusion of support vector classification and regression”. patternsonscreen. that this technique is very promising in classification, net/cuSVMDesc. pdf, pp. 1-9. forecasting tasks and could be used in big data areas. The [20] Vaněk J., Michálek J., Psutka J. (2017), “A GPU-Architecture Optimized systems implemented on GPU is able to process a huge volume Hierarchical Decomposition Algorithm for Support Vector Machine of parameters faster than CPU. The findings of this paper may Training”. IEEE Transactions on Parallel and Distributed Systems, 28(12), pp. 3330-3343. be applied in the future works. [21] Raina R., Madhavan A., Ng, A. Y. (2009), “Large-scale deep ACKNOWLEDGMENT unsupervised learning using graphics processors”. In Proceedings of the 26th annual international conference on machine learning, pp. 873-880. We would also like to show our gratitude to the [22] Dean J., Corrado G., Monga R., Chen K., Devin M., Mao M., ... , Ng A. NANOTICK for providing high frequency data in Y. (2012), “Large scale distributed deep networks”. In Advances in neural microseconds of 5 commodity futures contracts. information processing systems, pp. 1223-1231. [23] Li Q., Salman R., Kecman V. (2010), “An intelligent system for REFERENCES accelerating parallel SVM classification problems on large datasets using GPU”. In Intelligent Systems Design and Applications (ISDA), 2010 10th [1] Margara A., Cugola G. (2011), “High performance content-based International Conference on, pp. 1131-1135. matching using GPUs”, Proceedings of the 5th ACM international conference on Distributed event-based system, New York, USA [24] Coates A., Huval B., Wang T., Wu D., Catanzaro B., Andrew N. (2013), “Deep learning with COTS HPC systems”. In International Conference [2] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008, July). Fast support on Machine Learning, pp. 1337-1345. vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning (pp. [25] Chen Z., Wang J., He H., Huang X. (2014), “A fast deep learning system 104-111). ACM. using GPU”. In Circuits and Systems (ISCAS), 2014 IEEE International Symposium on, pp. 1552-1555. [3] MechanicalForex. (2016), mechanicalforex.com. [ONLINE] Available at: http://mechanicalforex.com/2016/02/trading-and-the-gpu-wasted- [26] Hung Y., Wang W. (2012), “Accelerating parallel particle swarm power.html. [Accessed 12 January 2018]. optimization via GPU”. Optimization Methods and Software, 27(1), pp. 33-51. [4] Preis T. (2011), “GPU – computing in econophysics and statistical physics”, The European Physical Journal Special Topics, Vol. 194, pp. 87 [27] Cai X., Lai G. Lin X. (2013), “Forecasting large scale conditional – 119. volatility and covariance using neural network on GPU”. The Journal of Supercomputing, 63(2), pp.490-507. [5] [8] NVIDIA Corporation. (2008) NVIDIA CUDA Compute Unified Device Architecture. [28] Tran H. N., Cambria E. (2018), “Ensemble application of ELM and GPU for real-time multimodal sentiment analysis”. Memetic Computing, 10(1), [6] Kaya O. (2016), “High – frequency trading. Reaching the limits”, pp. 3-13. Automated trader magazine. Vol. 41, p. 23 – 27. [29] Krizhevsky A., Sutskever I., Hinton G. E. (2012), “Imagenet [7] Fox M. B., Glosten L. R., Rauterberg G. V. (2015), “The New Stock classification with deep convolutional neural networks”. In Advances in Market: Sense and Nonsense” , 65 Duke L.J. 191. neural information processing systems (pp. 1097-1105). [8] Herlemont D. (2013), “Pairs Trading, Convergence Trading, [30] Bonanno, F., Capizzi, G., Sciuto, G. L., Napoli, C., Pappalardo, G., & Cointegration”, Quantitative Finance, Vol. 12(9). Tramontana, E. (2014). A novel cloud-distributed toolbox for optimal [9] Zubulake P., Lee S. (2011), “The High frequency game changer: how energy dispatch management from renewables in igss by using wrnn automated trading strategies have revolutionized the markets”, Aite predictors and gpu parallel solutions. In International Symposium on group. Wiley trading. Power Electronics, Electrical Drives, Automation and Motion [10] Brogaard J., Hendershott J. T., Riordan R. (2013), “High frequency (SPEEDAM), (pp. 1077-1084. trading and price discovery”, ECB Lamfalussy fellowship programme/ [31] Napoli, C., Pappalardo, G., Tramontana, E., & Zappalà, G. (2014). A Working paper series, No 1602, European central bank Press. cloud-distributed GPU architecture for pattern identification in segmented [11] Jaramillo C. (2016), “The Revolt against High-Frequency Trading: From detectors big-data surveys. The Computer Journal, 59(3), 338-352. Flash Boys, to Class Actions, to IEX”, Review of banking & financial law, Vol. 35, pp. 483 – 499. [12] Kirchner S. (2015), “High frequency trading: Fact and fiction”, Policy: A Journal of Public Policy and Ideas, Vol. 31(4), pp. 8-20. [13] Limaye S. S. (2014),” Electronically aided High frequency trading”, International Journal of Engineering Research and Applications, pp. 14 – 18. [14] McKenny D., White T. (2012), “Stock Trading Strategy Creation Using GP on GPU”, Soft Computing, Vol 16(2), pp. 247 – 259. [15] Salleh N. S. M., Baharim M. F. (2015), “Performance Comparison of Parallel Execution Using GPU and CPU in SVM Training Session”. In Advanced Computer Science Applications and Technologies (ACSAT), 2015 4th International Conference on, pp. 214-217. [16] Li X., Li K., Zhang G., Zheng W. (2015), “Deep Learning and Its Parallelization: Concepts and Instances”. 54