=Paper=
{{Paper
|id=Vol-2744/invited2
|storemode=property
|title=MIOpen: An Open Source Library For Deep Learning Primitives (invited paper)
|pdfUrl=https://ceur-ws.org/Vol-2744/invited2.pdf
|volume=Vol-2744
|authors=Jehandad Khan,Paul Fultz,Artem Tamazov,Daniel Lowell,Chao Liu,Michael Melesse,Murali Nandhimandalam,Kamil Nasyrov,Ilya Perminov,Tejash Shah,Vasilii Filippov,Jing Zhang,Jing Zhou,Bragadeesh Natarajan,Mayank Daga
}}
==MIOpen: An Open Source Library For Deep Learning Primitives (invited paper)==
MIOpen: An Open Source Library For Deep Learning Primitives Jehandad Khan[0000−0003−4479−1871] , Paul Fultz[0000−0002−3423−2315] , Artem Tamazov[0000−0002−7427−8676] , Daniel Lowell[0000−0002−8929−7837] , Chao Liu[0000−0002−6943−07919] , Michael Melesse[0000−0001−5663−9733] , Murali Nandhimandalam[0000−0001−5427−3184] , Kamil Nasyrov[0000−0003−2026−7100] , Ilya Perminov[0000−0003−0486−5821] , Tejash Shah[0000−0003−0354−5674] , Vasilii Filippov[0000−0003−0559−0380] , Jing Zhang[0000−0001−8114−1080] , Jing [0000−0002−4294−3985] Zhou , Bragadeesh Natarajan[0000−0001−6848−0694] , and Mayank Daga[0000−0002−2637−302X] AMD Inc. Mayank.Daga@amd.com Abstract. Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: an abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the re- searchers and industry. Deep neural networks can be decomposed into a series of different operators. MIOpen, AMD’s open-source deep learning primitives li- brary for GPUs, provides highly optimized implementations of such operators, shielding researchers from internal implementation details and hence, accelerat- ing the time to discovery. This paper introduces MIOpen and provides details about the internal workings of the library and supported features. MIOpen innovates on several fronts, such as implementing fusion to optimize for memory bandwidth and GPU launch overheads, providing an auto-tuning infras- tructure to overcome the large design space of problem configurations, and im- plementing different algorithms to optimize convolutions for different filter and input sizes. MIOpen is one of the first libraries to publicly support the bfloat16 data-type for convolutions, allowing efficient training at lower precision without the loss of accuracy. Keywords: Convolution, Deep Learning, GPU, HIP, Machine Learning, MIOpen, OpenCL® , Performance. 1 Introduction Deep Learning has burgeoned into one of the most important technological break- throughs of the 21st century. The use of deep learning has garnered immense success in applications like image and speech recognition, recommendation systems, and language Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 2 J. Khan et al. translation. This in turn advances fields like autonomous driving and disease diagnosis. GPUs have played a critical role in the advancement of deep learning. The massively parallel computational power of GPUs has been influential in reducing the training time of complex deep learning models hence, accelerating the time to discovery [29]. The availability of open-source frameworks like TensorFlow and PyTorch is another corner- stone for the fast-paced innovation in deep learning [1, 24]. The deep learning frameworks decompose the models as either a computational graph or a sequence of operations [12, 22]. These high-level operations are then com- piled down to a series of hardware specific high-performance primitives. These prim- itives in deep learning are akin to BLAS (Basic Linear Algebra Subprograms) [18] in linear algebra and high performance computing. Availability of a library which pro- vides highly optimized implementations of such primitives enables the deep learning researchers to focus on their science and leaves the burden of developing such primi- tives on the hardware vendors. The library then provides a simple and callable appli- cation programming interface (API) to enable seamless integration with client libraries and be flexible so that new features may be added easily. MIOpen is AMD’s deep learning primitives library which provides highly opti- mized, and hand-tuned implementations of different operators such as convolution, batch normalization, pooling, softmax, activation and layers for Recurrent Neural Net- works (RNNs), used in both training and inference [9]. Moreover, MIOpen is fully open-source including all its GPU kernels; complementing AMD’s open-source ROCm stack [4]. MIOpen is the first to extend the open-source advantage into GPU vendor libraries thereby, continuing to embark on the same ethos as the deep learning commu- nity. As deep learning has gained critical acclaim over the years, substantial research has been conducted to accelerate it. One optimization technique called fusion has been recognized to be more potent than others [19]. Fusion allows to fuse or collapse several neural network layers thereby, optimizing on 1) memory bandwidth requirements by requiring less data to be moved between host and GPU memories, and 2) GPU kernel launch overheads by launching fewer GPU kernels compared to the vanilla, non-fused neural network. Aside from discrete primitives, MIOpen also offers a fusion API which allows the frameworks to fuse some of the operations mentioned above. MIOpen fusion can be used to accelerate both convolution and recurrent neural networks. Another area that has flourished with the popularity of deep learning is open-source graph compilers [19], [26], [16], [6]. Graph compilers further the relevance of deep learning to wide-spread applications by generating the implementations of aforemen- tioned operators instead of relying on hardware specific libraries. However, generating high-performance implementations of two operators, convolution and GEMM, is ex- tremely cumbersome without inherent knowledge of the underlying hardware. There- fore, the graph compilers rely on libraries like MIOpen for these operators. MIOpen’s open-source nature enables a plethora of optimization opportunities which were not possible before. For example, fusing an operator generated by the compiler with MIOpen’s convolutions. MIOpen facilitates these optimization by breaking down complex oper- ators like convolutions into several simple and small operators and providing high- MIOpen: An Open Source Library For Deep Learning Primitives 3 performance implementations of these simple operators to the graph compiler. This MIOpen feature is called composable kernels. The primary aim of MIOpen is to provide access to high-performance kernels, sup- port several data-types, and also support as many hardware targets as required. To that end, MIOpen supports four different data-types: float32, float16, bfloat16, and int8, and two programming models: OpenCL® [23] and HIP. The kernels in MIOpen are backed by both high-level language as well as hand-tuned assembly imple- mentations. MIOpen also provides an auto-tuning infrastructure to achieve maximum performance on the user’s hardware and software environment. This document provides an under-the-hood look at the MIOpen library providing detailed information about the functionality of the library as well as introduce MIOpen’s capabilities to users and developers. The rest of the paper is organized as follows: Sec- tion 2 describes some prior work, Section 3 describes the overall design philosophy of the library and provides details about kernel compilation, abstractions used to localize those details in the library, tuning infrastructure for improving kernel performance and MIOpen’s support for OpenCL® and HIP. This is followed by Section 4 which provides details about the supported operations; primarily the convolution operation. Section 5 describes MIOpen’s Fusion API for merging different operations for increased per- formance, this is followed by some usage statistics and performance comparisons in Section 6. Section 7 presents conclusion and future work. 2 Related Work Developing hardware-optimized libraries for most critical and time-sensitive operations is a well-known practice. For linear algebra such libraries are known as BLAS (Basic Linear Algebra Subsystem) and have different implementations for different systems [5, 18, 28]. In similar spirit different deep learning libraries have been written, to make it easier for client applications to implement different deep learning primitives. Alex Krischevsky’s cuda-convnet is one of the initial libraries to implement convolutions and inspired many others [27], [11]. Chetlur et al. developed cuDNN, a deep neural network library for nvidia GPUs [7]. MIOpen falls in this category since it provides a C programming language based API for deep learning primitives. While these libraries aim to accelerate deep learning primitives on GPUs, research also been conducted to improve the performance of inference only loads on different CPUs such as MKL- DNN [11]. Most of the above mentioned libraries focus on lower level optimization opportuni- ties. An orthogonal approach is to abstract this detail behind a domain specific language (DSL). This technique has already been successfully applied to other domains such as computer vision and linear algebra [13, 14, 25]. Vasilache et al. developed Tensor Com- prehensions, which takes a similar approach and designs a language which can infer tensor dimensions and summation indices automatically [27]. However, such an ap- proach makes it complicated to support a wide array of platforms and hardware targets as is required of MIOpen. 4 J. Khan et al. 2.1 MIOpen and higher level frameworks The above libraries are augmented by a community of frameworks which enable re- searchers and practitioners to express their computation pipeline using a host language (typically Python™ or some other higher level language) [12] [1]. These frameworks in turn call out libraries such as MIOpen for efficient implementation of the primitives required to implement the computation in those graphs. Frameworks strive to support a wide array of hardware and applications, for instance both TensorFlow and PyTorch al- ready support MIOpen as a backend aimed at AMD GPUs. Thus a user can seamlessly change the hardware target without changing their application code. 3 Overall Design This section describes the MIOpen’s design philosophy using the convolution operation as an example. 3.1 Kernels and Solvers Mapping a problem description to a particular kernel requires MIOpen to determine the file which contains the required GPU kernel, the name of the kernel in the file and the compiler arguments required to compile it. Typically, there is more than one kernel which can perform similar operations. However, each kernel has a unique set of constraints and may result in different performance due to differing code optimizations and input dimensions of the problem. All this information is grouped in MIOpen classes collectively called solvers. These classes together solve for the best convolution kernel given a problem description. This construct creates a layer of abstraction between the rest of the MIOpen library and kernel specific details, thus all the details of a kernel are completely localized. If a developer wishes to add a new kernel to the MIOpen, all that is required is to add the source code for the kernel and implement the associated solver, thereafter the kernel may be selected automatically. 3.2 Auto tuning infrastructure In general, any high-performance code leverages auto-tuning for choosing the parame- ters that may change with the underlying architecture as well as the problem description thereby, impacting performance. MIOpen is not an exception to this rule. This requires that all tunable kernels be tuned for known configuration to achieve maximum perfor- mance. Once known, these tuning parameters can be shipped with MIOpen or, the user may employ the same infrastructure to tune MIOpen kernels for custom configurations. A solver encapsulates the constraints for the tuning parameters as well as the in- terface machinery to launch tuning instances. The tuning parameters create a grid of possible values of the kernel tuning parameters and the tuning infrastructure compiles and launches a unique kernel for each of these combinations using a pruned search space approach. Once a kernel is tuned and the optimum tuning parameters are known, they are serialized to a designated directory on the user’s system for future retrieval. MIOpen: An Open Source Library For Deep Learning Primitives 5 3.3 Kernel compilation and caching Launching a kernel requires setting up the compilation parameters and invoking a device- code compiler to generate the binary object. MIOpen device-code consists of kernels written in OpenCL® , HIP [3] and GCN assembly [2], which may be compiled using clang [8]. Since compiling a kernel is a costly and time-consuming procedure, MIOpen em- ploys two levels of caching to improve the runtime performance of the library. This design choice is tightly coupled with how device-code compilers compile and load compute-kernels from the binaries. Once an kernel file is compiled, it is cached to disk to avoid future compilations of the same source-file with the same parameters. Due to the caching effects described above, it is recommended that the user’s application performs a warmup iteration so that MIOpen’s different caches can be populated. Such runs will ensure that subsequent network invocations are accurately timed without the effects of disk I/O or compilation delays. This limitation is not unique to MIOpen and is also applicable to other high- performance libraries. 3.4 HIP and OpenCL® backends MIOpen supports applications that use the OpenCL® and HIP programming models [3]. All the APIs remain consistent from the client application’s perspective, the only differ- ence is in the creation of miopenHandle structure, which is created either with a HIP stream or an OpenCL® device context. Internally the HIP backend compiles the kernel using an appropriate complier depending on the kernel source type. Subsequently, the compiled binary object is loaded and passed off to the runtime for execution. 4 Machine Learning Primitives 4.1 Convolution Most modern neural networks employ convolution as a central operation. Its usefulness and popularity make it a critical piece of the machine learning puzzle, particularly in image processing. The numerical complexity of the convolution operation and it’s diverse set of inputs make it difficult to generalize accross multiple hardware architectures. Different algo- rithms have been proposed to compute the convolution of a filter and an image, among them MIOpen provides implementation for the Winograd algorithm [17], a direct algo- rithm and using the matrix-matrix multiplication (GEMM) operation [12] as well as the Fast Fourier Transform. The best performing algorithm is rarely readily apparent on a given architecture for a set of input and filter dimensions. To assess the relative performance of these kernels and return the best performing kernel, MIOpen employs the find step before the actual convolution operation. For this step, the user constructs the necessary data structures for the input/output image tensors as well as the convolution descriptor specifying the properties of convolution such as striding, dilation, and padding. The user then calls the 6 J. Khan et al. MIOpen convolution Find API which allows MIOpen to benchmark all the applicable kernels for the given problem configuration, this information is returned in an array of type miopenConvAlgoPerf t. This enables the library to adjust for any variations in the user hardware and also allows the user to balance the trade-off between execution time and additional memory that may be required for some algorithms. Types of Supported Convolutions Transpose Convolution Transposed Convolution (also known as deconvolution or fractionally-strided convolution) is an operation typically used to increase the size of the tensor resulting from convolution. The standard convolution operation reduces the size of the image, which is desirable in classification tasks. However, tasks such as image segmentation require the output tensor to have the same size as the input. MIOpen sup- ports transpose convolution required by such networks and may be enabled by setting the miopenConvolutionMode t in miopenConvolutionDescriptor t to miopenTranspose. Depthwise convolution In depthwise convolution, the input is separated along the depth (channels) and then is convolved with a filter that is also separated along the same axis. The results are stacked into a tensor. Separating out the process of finding spatial correlation and cross channel correlations, results in fewer parameters as compared to regular convolution. Smaller and more efficient neural networks with depthwise sepa- rable convolutions have applications in training on embedded systems such as mobile phones. Grouped convolutions Group convolutions were introduced in Alexnet [15], to re- duce the memory required for convolution operation. Grouped convolutions are able to achieve accuracy similar to non-grouped convolutions while having fewer parameters. Further details may be found in [15]. The function miopenSetConvolutionGroupCount may be used to set the group count for a groupwise convolution. To perform a depthwise convolution use the same function to set group count to the number of channels [10]. Composable Kernels Different variations of the convolution operation discussed above as well as the variety of algorithms that may be used to implement them make it dif- ficult to develop efficient kernels. One solution to tackle this complexity is to break down these operations into reusable modules that can be universally used by different implementations of different algorithms, and express a kernel as a composition of these modules. Development work would fall into one of the following categories: 1) describe an algorithm with a hardware-agnostic expression, 2) decide how to map the hardware- agnostic expressions into hardware-dependent modules, 3) implement and optimize the hardware-dependent modules for specific hardware. Breaking down these primitives into smaller modules opens new doors to optimization that may fuse these modules together. MIOpen: An Open Source Library For Deep Learning Primitives 7 This new kernel programming model is referred to as composable kernels in MIOpen. MIOpen v2.0 includes an implementation of the implicit GEMM convolution algo- rithm, using the composable kernel programming approach. Further details about this novel programming paradigm will be published in the future. 4.2 Batch Normalization Batch normalization is a very successful technique for accelerating deep neural net- work training. There are two versions of batch normalization supported in MIOpen: Per-activation and Spatial batch normalization. Per-activation batch normal- ization is typically positioned after a fully connected layer in a network. Batch normal- ization for convolution layers is termed spatial in that it learns separate scaling(γi ) and bias(βi ) parameters for each channel, the resulting transform is applied to all the activations in a single feature map. MIOpen supports the batch normalization operation for both training and inference. They all accept the mode parameter from the miopenBatchNormMode t enum, Which has two modes miopenBNPerActivation, which does element-wise nor- malization for a fully connected layer and miopenBNSpatial which does normal- ization for convolutions layers. For more information see [20] and [21]. 4.3 Recurrent Neural Networks MIOpen supports three RNN types prevalent in the industry and research: vanilla RNN, LSTM and, GRU and two kinds of activation function for the hidden state of vanilla RNN neuron: Rectified Linear Unit (ReLU) and hyperbolic tangent (Tanh). Further- more, information through the RNN may flow in the forward direction (unidirectional RNNs) or both in the forward and backward directions (bi-directional RNNs). MIOpen supports all three RNN types in the unidirectional miopenRNNunidirection as well as the bidirectional model miopenRNNbidirection. Some RNN layers take input sequences directly from the output of a previous layer while others require a transform to align the intermediate vector dimension or simply to achieve better results. MIOpen satisfies this requirement by supporting two input types: 1) miopenRNNlin- ear, which performs a linear transform before feeding the input to the neuron, and 2) miopenRNNskip, which allows a direct input into the neuron. Similarly, bias to the neural network may be added or removed by choosing the mode miopenRNNWith- Bias or miopenRNNNoBias. The dependence of current state on the previous state as well as different RNN configurations make it difficult to achieve high computational efficiency on a GPU plat- form. Prevalent frameworks such as TensorFlow encapsulate the state updating func- tions of the RNN neuron in a cell format to achieve better compatibility in different modes, though the impact of the data layouts and computation procedures on perfor- mance is neglected. MIOpen handles the RNN computation by taking advantage of two powerful ROCm platform GEMM libraries (1) rocBLAS for the HIP backend, and (2) MIOpenGEMM for the OpenCL® backend, which are augmented by specialized MIOpen kernels for other primitive functions. 8 J. Khan et al. MIOpen achieves high computational efficiency for RNNs by batching together dif- ferent time steps and performing them as single GEMM operation. The is made possible due to the independent input vectors at different time points. In addition to the operations mentioned above, other operations required to support popular neural network architectures are also supported by MIOpen. 5 Fusion API Most neural networks are data-flow graphs where data flows from one direction and is operated upon as it moves from one layer to another. While conceptually data is flowing only in one direction, the underlying kernels implementing these operations have to read data from the global memory, operate on the data and then write the result back for layers down the pipeline. This is necessary due to the limited on-chip memory of the GPUs given the large image and filter sizes in neural network architectures. However, not all operations require that data be read from and written back to the global memory each time. That is some operations may be fused to increase the compute efficiency of these kernels. This merger of the operations to be performed by a single kernel may be termed as kernel-fusion. As a simple example let us consider an addition operation followed by a rectified linear unit (ReLU) operation. In this case, the intermediate result need not be written back to the main memory, and both the operations may be performed while the individ- ual data elements are in the on-chip memory. Another common sequence of operations is convolution followed by a bias (addition) and ReLU operation. It must be kept in mind that fusions for other operators are much more involved such as the fusion of the convolution and batch normalization operation. The MIOpen library offers the fusion API to facilitate the efficient fusion of such operations; it allows the user to specify a sequence of operations that are desired to be fused. Once the user specifies this sequence, MIOpen decides the applicable kernel and compiles it; all this information is encapsulated in the miopenFusionPlan- Descriptor data structure [21]. If merging of the required fusion sequence is feasible, the compilation step of the fusion plan will return success; thereafter the user would supply the runtime arguments for the kernels such as parameters for different operations. Following which, the user would execute the fusion plan with data pointers for the input and output data. The advantage of separating the compilation step from the argument structure is that the fusion plan which has been compiled once, need not be compiled again for different input values. Further details and example code can be found at [20]. 6 Results This section highlights the performance improvements that MIOpen is able to offers particularly in convolution as well as some supported fusions. To date, the primary beneficiary of Machine Learning progress has been machine vision as well as Natu- ral Language processing. In machine vision, the convolution operation is the primary MIOpen: An Open Source Library For Deep Learning Primitives 9 workhorse due to the low number of parameters required to learn as compared to regular neural networks as well as the favorable mathematical properties. However, the param- eters associated with the convolution operations in different deep convolution neural networks have changed considerably. The early CNNs employed larger filter sizes to re- duce the height and width of the feature maps and simultaneously increase the number of feature maps. For instance, LeNet employed filters of size 5 × 5 while, Alexnet [15] contained filters of size 5 × 5 as well as 11 × 11. However, recently networks have almost exclusively relied on smaller filter sizes namely only 1 × 1 and 3 × 3 coupled with striding to reduce the size of the feature map. Figure 1 shows the relative speedup of different convolution configurations as com- pared to MIOpen’s im2col+GEMM implementation. The configurations shown therein have been selected randomly from different popular networks such as GoogLeNet, In- ception v3, and Inception v4 for image classification. The y-axis in Figure 1 shows log of the speedup obtained by MIOpen, while the x-axis shows the labels for different configurations. Each label shows, respectively, the filter height, filter width, input chan- nels, image height, image width, output channels, padding (height) and padding (width) separated by a hyphen (-). Figures 1a, 1c and 1e depict the performance gains for kernels with filter height and width equal to 1 (1 × 1 convolutions) in the forward, backwards-data and backwards- weights directions respectively. While mathematically 1 × 1 convolutions may be de- scribed as a pure GEMM operation, still MIOpen may provide substantial performance benefit in certain cases. Similarly, Figures 1b, 1d and 1f show the performance benefit attained for non-1 × 1 kernels in the forward, backward-data and backward-weights directions respectively. As mentioned in Section 4 MIOpen employs the Winograd algorithm for applicable convolutions while the 1 × 1 convolutions are primarily serviced by kernels written in GCN ISA. Due to the efficiency of the Winograd algorithm, MIOpen can speed up many 3 × 3 convolutions, however, on larger filter sizes it is not as effective due to granularity loss. Wherein MIOpen’s other convolutional kernels step in to provide speedup, however, in some cases, this speedup is not substantial. The MIOpen team is continuously working on new algorithms to improve performance in these areas. 7 Conclusions and Future Work This paper identified some of the challenges faced by a high performance computing li- brary and some of the mechanisms implemented in MIOpen to address these challenges were presented. The open source nature of MIOpen makes it easy for researchers and academics to experiment and implement novel solutions to these problems, the authors look forward to constructive feedback from the community. Acknowledgements The MIOpen team would like to gratefully acknowledge the valuable contributions of Alex Lyashevsky, James Newling, and the GitHub user ghostplant as well as the support of the open source community. 10 J. Khan et al. (a) 1x1 filter size (Forward) (b) non 1x1 filter sizes (Forward) (c) 1x1 filter size (Backward Data) (d) non 1x1 filter sizes (Backward Data) (e) 1x1 filter size (Backward Weights) (f) non 1x1 filter sizes (Backward Weights) Fig. 1. Relative performance improvement for different convolution configurations as compared to im2col+GEMM References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: A system for large-scale machine learning. In: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). pp. 265–283 (2016) 2. AMD GCN ISA. https://developer.amd.com/resources/developer-guides-manuals ,last ac- cessed 2020/07/15 3. AMD HIP. https://github.com/ROCm-Developer-Tools/HIP, last accessed 2020/08/13 4. AMD Inc: ROCm - Open Source Platform for HPC and Ultrascale GPU Computing, https: //github.com/ROCmSoftwarePlatform, last accessed 2020/08/14 5. Belter, G., Jessup, E.R., Karlin, I., Siek, J.G.: Automating the generation of composed linear algebra kernels. In: Proceedings of the Conference on High Performance Computing Net- working, Storage and Analysis. p. 59. ACM (2009) MIOpen: An Open Source Library For Deep Learning Primitives 11 6. Chen, T., Moreau, T., Jiang, Z., Shen, H., Yan, E.Q., Wang, L., Hu, Y., Ceze, L., Guestrin, C., Krishnamurthy, A.: TVM: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 pp. 1–15 (2018) 7. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shel- hamer, E.: cuDNN: Efficient Primitives for Deep Learning. arXiv pp. 1–9 (2014). https://doi.org/10.1002/polb.23894, http://arxiv.org/abs/1410.0759 8. Clang: a C language family frontend for LLVM, http://clang.llvm.org/ last accessed 2020/07/18 9. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735– 1780 (1997) 10. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Appli- cations. arXiv preprint arXiv:1704.04861 (2017) 11. Intel MKL-DNN, https://software.intel.com/en-us/articles/introducing-dnn-primitives-in- intelr-mkl 12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014) 13. Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1(OOPSLA), 77 (2017) 14. Kjolstad, F., Kamil, S., Ragan-Kelley, J., Levin, D.I., Sueda, S., Chen, D., Vouga, E., Kauf- man, D.M., Kanwar, G., Matusik, W., et al.: Simit: A language for physical simulation. ACM Transactions on Graphics (TOG) 35(2), 20 (2016) 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012) 16. Lattner, C., Pienaar, J.: MLIR Primer: A Compiler Infrastructure for the End of Moore’s Law. Google Research (2019) 17. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4013–4021 (2016) 18. Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for fortran usage. vol. 5, pp. 308–323. ACM New York, NY, USA (1979) 19. Leary, C., Wang, T.: XLA: TensorFlow, compiled. TensorFlow Dev Summit (2017) 20. MIOpen: Documentation, https://rocmsoftwareplatform.github.io/MIOpen/doc/html, last ac- cessed 2020/08/14 21. MIOpen: AMD’s library for high performance machine learning primitives, https://github. com/ROCmSoftwarePlatform/MIOpen, last accessed 2020/05/15 22. MLIR: Multi-level Intermediate Representation for Compiler Infrastructure. https://github. com/tensorflow/mlir, last accessed 2020/09/03 23. Munshi, A.: The OpenCL specification. In: Hot Chips 21 Symposium (HCS), 2009 IEEE. pp. 1–314. IEEE (2009) 24. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems. pp. 8026–8037 (2019) 25. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48(6), 519–530 (2013) 26. Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, R., Gibson, N., Hege- man, J., Lele, M., Levenstein, R., et al.: Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018) 12 J. Khan et al. 27. Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, W.S., Verdoolaege, S., Adams, A., Cohen, A.: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv 2 (2018), http://arxiv.org/abs/1802.04730 28. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: SC’98: Pro- ceedings of the 1998 ACM/IEEE conference on Supercomputing. pp. 38–38. IEEE (1998) 29. You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., Keutzer, K.: ImageNet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing. p. 1. ACM (2018)