1. Introduction

High Performance Third-order Tensor-Train and Tensor-Ring Decompositions on GPUs

Hao Hong

Weiqin Tong

Tao Zhang

Xiaoyang Liu

0 0 Columbia University , 116th St & Broadway, New York, 10027 , USA 1 Shanghai University , No.99 Shangda Road BaoShan District, Shanghai, 200444 , China

85 91

Tensor decomposition is an essential tool for analyzing data in many fields, such as sociology, financial encryption and signal processing. According to the “Curse of Dimensionality,” the time and the space cost of the tensor decomposition increase quickly with the tensor size. The high-performance GPU-based tensor-train (TT) and tensor-ring (TR) decompositions implementations are proposed in this paper. Firstly, we utilize the high-parallel Jacobi-based singular value decomposition (SVD) for replacing the traditional SVD to match the GPU structure. Secondly, we design a high-performance matrix multiplication on GPU. Thirdly, by observing data storage, we propose optimized memory access to reduce the memory footprint. Moreover, we conducted experiments to verify the performance of our algorithm on a V100 GPU. Our optimized GPU-based TT and TR decomposition implementations get maximum of 6.67× and 6.36× speedups over the basic implementations.

1 TT decomposition TR decomposition GPU Jacobi SVD

1. Introduction 2. Tensor-Train and Tensor-Ring Decompositions

This section describes notations, TT decomposition, and TR decomposition.

2.1 Operations and Notations

This paper utilizes boldface lowercase letters ∈ ℝn , boldface uppercase letters ∈ ℝn1×n2 , and uppercase calligraphic letters

∈ ℝn1×n2×n3 to denote vectors, matrices, and tensors, respectively.

Tensor contractions is represented with the ∘ symbol.

2.2 Tensor-Train Decomposition

r0 = r3 = 1. Therefore, (1) and (3) are second-order tensors (matrices). where ( ) ∈ Rrk−1×nk×rk expresses the k-th core tensor. [r0, r1, r2, r3] expresses the TT-ranks where

The tensor-train structure is one of the tensor networks and is represented in Figure 1 by the graphical modeling [7]. The connections between two tensors indicate tensor contractions. The original tensor is obtained by the tensor contractions of all tensors on the left. The steps of third-order TT decomposition [9] are described in Algorithm 1. In the third and tenth lines of Algorithm 1, we convert a tensor ( −1) into a matrix with rk−1nk rows and ∏i3=k+1 ni columns and a matrix into a tensor (k) with nk columns, rk in the third direction, and rk−1 rows and by reshaping operations, respectively.

2.3 Tensor-Ring Decomposition

(1) (2) where tensor contractions are calculated between tensors and also between (1) and (3). [r0, r1, r2, r3] expresses the TR-ranks. Because of the ring structure of TR tensors, r0 = r3 do not need to be forced to equal 1, which is used to distinguish between TR and TT structures. TR structure is another special case of tensor networks and steps of TR decomposition [10] are described in Algorithm 2.

3. High-performance Third-order Decompositions on GPUs 3.1 Parallelization Schemes Tensor-Train and Tensor-Ring

In this section, we propose three parallel optimizations for TT decomposition in Algorithm 1 and TR decomposition in Algorithm 2.

Jacobi SVD in Parallel

In TT and TR decompositions, the matrix SVD operations take up the most time, reaching 67%. The traditional SVD operation is not matched to GPU structure because of its low parallelism characteristics. As a substitute, Jacobi SVD [8] is adopted to match the GPU’s high-parallelism feature. Jacobi SVD needs an iteration number and an accuracy to determine when the algorithm terminates. Under the single precision of data and calculation, this paper set the maximum iteration to 100 and the accuracy to 10e8 for getting the minimum error in experiments.

The algorithm stops when the number of iterations reaches the maximum number of iterations or the error between the repaired matrix and the original matrix reaches a preset threshold.

Diagonal Matrix and Matrix Multiplication in Parallel

Diagonal matrix and matrix multiplication is the operation with the second longest time occupation in the eleventh line of Algorithm 1 and the eleventh and fifth lines of Algorithm 2. The time cost of these operations increases rapidly with the dimension size of data. We find that the traditional processes introduce redundant calcula-tions because values exist only on the diagonal. Therefore, we accelerate the computation using the following parallel computation method:

T = parallel(sk ⋅ kT), (3) where sk and kT represent the k-th value and row of matrix on the diagonal and matrix T. This parallel computation method takes advantage of parallelism and reduces redundant computations.

Element-wise Product in Parallel

The sixth line to the ninth line of Algorithm 1 and the fifth line to the seventh line of Algorithm 2 are the element-wise products which can be calculated in parallel. We utilize the following parallel method to perform the element-wise product s ⋅ s, with m = #(s): parallel (sm(0−)k+1 = sk ⋅ sk) , 1 ≤ k ≤ m. (4)

3.2 Optimized Memory Access 3.3 Efficient Data Transfer

The input and output data volumes of TT and TR decompositions increase quickly with tensor sizes, which results in high time consumption of data transfer between CPUs and GPUs. To reduce the overhead of access space in transmission, the cores 1, 2, 3 are combineded into an array . Meanwhile, [ 1, 2, 3] are used to store dimensions of the input tensor and [r0, r1, r2, r3] are used to store TR-ranks or TT-ranks. Therefore, k-th core is acquired through = reshape( (∏jk=−11 rjnjrj+1 , ∏jk=1 rjnjrj+1), [rk−1, nk, rk]) , (∏jk=−11 rjnjrj+1 , ∏jk=1 rjnjrj+1) denotes the elements from ∏jk=−11 rjnjrj+1 to ∏jk=1 rjnjrj+1.

4. Performance Evaluation

Our experiments run on a server with 80 GB host memory. The server is equipped with two Intel Xeon E5-2640 V4 CPUs. Each CPU has ten cores supporting twenty hardware threads. Moreover, the server is equipped with one Tesla V100 GPU with 32GB device memory and 5,120 CUDA cores @1.53 GHz. We focus on the speedups of our experiment result: speedup = (CPU running time)/(GPU running time). The relative square error (RSE) is utilized to measure the error of data before and after decomposition: RSE = || − (1) ∘ (2) ∘ (3)||F/|| ||F. The experiment tensor data are obtained by tensor contractions of three small tensors. For the Jacobi SVD, under single precision, the accuracy ɛ is set to10e-8, and the max iteration time is set to 100.

The speedups and running time of our optimized third-order TT decomposition are exhibited in Figure 4. The tensor sizes vary from 100 × 100 × 100 to 1,200 × 1,200 × 1,200 . The CPU implemen-tation is referred from MATLAB code [9]. Because of the GPU memory size, the maximum tensor size that can be processed is 1,200 × 1,200 × 1,200 on Tesla V100 GPU. Compared with the CPU implementations, the optimized GPU implement-tation obtains 14.25× on average and up to 24.80× speedups, which are higher than the GPU baseline implementation. The RSE of CPU and GPU are on the 10e-4 level. In our experiment, the speedups of the optimized implementations have a general upward trend.

5. Conclusions

High-performance third-order tensor-train and tensor-ring decomposition implementations on GPUs are proposed in this paper. To improve the efficiency of the algorithms, three optimization strategies are proposed. First, efficient memory access is proposed to reduce the memory footprint. Second, parallelization strategies are widely adopted in algorithms to match GPUs. Third, we use the highparallel Jacobi SVD to reduce time for critical calculations.

Moreover, we experimentally verify the advantages of our optimized decomposition algorithms. The third-order TT and TR decompositions get maximum of 6.67× and 6.36× speedups. Implementing multi-GPU implementations of high-order TT and TR decompositions is our future work. Meanwhile, the optimized third-order TT and TR decomposition algorithms will be combined into the cuTensor library [6].

6. References

[1] Rettinger A, and et al. Context-aware tensor decomposition for relation prediction in social networks[J]. Social Network Analysis and Mining, 2012, 2(4): 373-385. [2] Charlier J, Falk E, and et al. User-device authentication in mobile banking using APHEN for PARATUCK2 tensor decomposition[C]//2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2018: 886-894. [3] Huang L, Wu X, and et al. Color demosaicking via nonlocal tensor representation[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017: 1812-1816. [4] Han X, Wu B, and et al. Tensor FISTA-Net for real-time snapshot compressive imaging[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 1093310940. [5] Ma J, Liu X Y, and et al. Deep tensor admm-net for snapshot compressive imaging[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 10223-10232. [6] Zhang T, Liu X Y, and et al. cuTensor-Tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 31(3): 595610. [7] Roberts C, Milsted A, and et al. Tensornetwork: A library for physics and machine learning[J].

arXiv preprint arXiv:1905.01330, 2019. [8] Drmač Z, Veselić K. New fast and accurate Jacobi SVD algorithm. I[J]. SIAM Journal on matrix analysis and applications, 2008, 29(4): 1322-1342. [9] Oseledets I V. Tensor-train decomposition[J]. SIAM Journal on Scientific Computing, 2011, 33(5): 2295-2317. [10] Zhao Q, Zhou G, and et al. Tensor ring decomposition[J]. arXiv preprint arXiv:1606.05535, 2016. [11] Mickelin O, Karaman S. On algorithms for and computing with the tensor ring decomposition[J].

Numerical Linear Algebra with Applications, 2020, 27(3): e2289.