<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">High Performance Third-order Tensor-Train and Tensor-Ring Decompositions on GPUs</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hao</forename><surname>Hong</surname></persName>
							<email>honghao@shu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Shanghai University</orgName>
								<address>
									<addrLine>No.99 Shangda Road BaoShan District</addrLine>
									<postCode>200444</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Weiqin</forename><surname>Tong</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Shanghai University</orgName>
								<address>
									<addrLine>No.99 Shangda Road BaoShan District</addrLine>
									<postCode>200444</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tao</forename><surname>Zhang</surname></persName>
							<email>taozhang@shu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">Shanghai University</orgName>
								<address>
									<addrLine>No.99 Shangda Road BaoShan District</addrLine>
									<postCode>200444</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Xiaoyang</forename><surname>Liu</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Columbia University</orgName>
								<address>
									<addrLine>116th St &amp; Broadway</addrLine>
									<postCode>10027</postCode>
									<region>New York</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="laboratory">ISCIPT2022</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">th International Conference on Computer and Information Processing Technology</orgName>
								<address>
									<addrLine>August 5-7</addrLine>
									<postCode>2022</postCode>
									<settlement>Shenyang</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">High Performance Third-order Tensor-Train and Tensor-Ring Decompositions on GPUs</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">51CE71BB7D6E23BD483D47A660A66FE7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T18:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Tensor decomposition is an essential tool for analyzing data in many fields, such as sociology, financial encryption and signal processing. According to the "Curse of Dimensionality," the time and the space cost of the tensor decomposition increase quickly with the tensor size. The high-performance GPU-based tensor-train (TT) and tensor-ring (TR) decompositions implementations are proposed in this paper. Firstly, we utilize the high-parallel Jacobi-based singular value decomposition (SVD) for replacing the traditional SVD to match the GPU structure. Secondly, we design a high-performance matrix multiplication on GPU. Thirdly, by observing data storage, we propose optimized memory access to reduce the memory footprint. Moreover, we conducted experiments to verify the performance of our algorithm on a V100 GPU. Our optimized GPU-based TT and TR decomposition implementations get maximum of 6.67× and 6.36× speedups over the basic implementations.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Tensor decomposition is an extension of matrix decomposition in higher dimensions. It is an essential tool in social relation prediction <ref type="bibr" target="#b0">[1]</ref>, financial encryption <ref type="bibr" target="#b1">[2]</ref>, and image processing <ref type="bibr" target="#b2">[3]</ref>. Where tensor-train (TT) and tensor-ring (TR) decomposition have been widely used in signal processing <ref type="bibr" target="#b3">[4]</ref>, computer vision <ref type="bibr" target="#b4">[5]</ref>, and data mining <ref type="bibr" target="#b5">[6]</ref>. According to the "Curse of Dimensionality," tensor decomposition's time and space cost increase quickly with the size and dimension of the tensor. It is a critical mission to develop high-performance tensor decompositions. Currently, CPU-based TT and TR decompositions <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b10">10]</ref> do not take full advantage of the algorithms' parallelism, making it difficult to process large amounts of data.</p><p>This paper utilizes GPUs to achieve high-performance TT and TR decomposition algorithms. Moreover, we conducted experiments comparing existing CPU algorithms with our optimized GPU algorithms, showing that our optimized TT and TR decomposition implemen-tations on GPUs are efficient.</p><p>There are three major contributions to this paper:</p><p>• This paper proposes high-performance GPU-based third-order TT and TR decom-positions with the same accuracy as CPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>This paper proposes efficient memory access of tensors in GPUs, an efficient diago-nal matrix and matrix multiplication, and utilizes the high-parallel Jacobi-based SVD for replacing the traditional SVD. The optimized algorithms reduce memory footprint and tensor matricization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>This paper conducts experiments to verify the performance of TT and TR decompositions on one Tesla V100 GPU. The optimized TT and TR decomposition implementations get maximum of 6.67 and 6.36 speedups over the GPU basic implementations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Tensor-Train and Tensor-Ring Decompositions</head><p>This section describes notations, TT decomposition, and TR decomposition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Operations and Notations</head><p>This paper utilizes boldface lowercase letters 𝐚 ∈ ℝ n , boldface uppercase letters 𝑨 ∈ ℝ n 1 ×n 2 , and uppercase calligraphic letters 𝒜 ∈ ℝ n 1 ×n 2 ×n 3 to denote vectors, matrices, and tensors, respectively. Tensor contractions is represented with the ∘ symbol. Figure <ref type="figure" target="#fig_0">1</ref> shows that the TT decomposition <ref type="bibr" target="#b8">[9]</ref> uses three third-order core tensors to express a thirdorder tensor 𝒜 ∈ ℝ 𝑛 1 ×𝑛 2 ×𝑛 3 by tensor contractions: 𝒜 = 𝒢 (1) ∘ 𝒢 (2) ∘ 𝒢 (3) ,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Tensor-Train Decomposition</head><p>(1) where 𝒢 (𝑘) ∈ R r k−1 ×n k ×r k expresses the k-th core tensor. [r 0 , r 1 , r 2 , r 3 ] expresses the TT-ranks where r 0 = r 3 = 1. Therefore, 𝒢 (1) and 𝒢 (3) are second-order tensors (matrices).</p><p>The tensor-train structure is one of the tensor networks and is represented in Figure <ref type="figure" target="#fig_0">1</ref> by the graphical modeling <ref type="bibr" target="#b6">[7]</ref>. The connections between two tensors indicate tensor contractions. The original tensor 𝒜 is obtained by the tensor contractions of all tensors on the left. The steps of third-order TT decomposition <ref type="bibr" target="#b8">[9]</ref> are described in Algorithm 1. In the third and tenth lines of Algorithm 1, we convert a tensor 𝒞 (𝑘−1) into a matrix 𝐂 with r k−1 n k rows and ∏ n i 3 i=k+1</p><p>columns and a matrix 𝑼 into a tensor 𝒢 (k) with n k columns, r k in the third direction, and r k−1 rows and by reshaping operations, respectively. Figure <ref type="figure" target="#fig_1">2</ref> shows that the TR decomposition <ref type="bibr" target="#b10">[10]</ref> uses three third-order core tensors 𝒢 (𝑘) ∈ ℝ r k ×n k ×r k+1 , k = 1,2,3 to express a third-order tensor 𝒜 ∈ ℝ 𝑛 1 ×𝑛 2 ×𝑛 3 by tensor contractions: 𝒜 = 𝒢 (1) ∘ 𝒢 (2) ∘ 𝒢 (3) ,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Tensor-Ring Decomposition</head><p>where tensor contractions are calculated between tensors and also between 𝒢 (1) and 𝒢 (3) . [r 0 , r 1 , r 2 , r 3 ] expresses the TR-ranks. Because of the ring structure of TR tensors, r 0 = r 3 do not need to be forced to equal 1, which is used to distinguish between TR and TT structures. TR structure is another special case of tensor networks and steps of TR decomposition <ref type="bibr" target="#b10">[10]</ref> are described in Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">High-performance Third-order</head><p>Tensor-Train and Tensor-Ring Decompositions on GPUs</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Parallelization Schemes</head><p>In this section, we propose three parallel optimizations for TT decomposition in Algorithm 1 and TR decomposition in Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Jacobi SVD in Parallel</head><p>In TT and TR decompositions, the matrix SVD operations take up the most time, reaching 67%. The traditional SVD operation is not matched to GPU structure because of its low parallelism characteristics. As a substitute, Jacobi SVD <ref type="bibr" target="#b7">[8]</ref> is adopted to match the GPU's high-parallelism feature. Jacobi SVD needs an iteration number and an accuracy to determine when the algorithm terminates. Under the single precision of data and calculation, this paper set the maximum iteration to 100 and the accuracy to 10e-8 for getting the minimum error in experiments.</p><p>The algorithm stops when the number of iterations reaches the maximum number of iterations or the error between the repaired matrix and the original matrix reaches a preset threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Diagonal Matrix and Matrix Multiplication in Parallel</head><p>Diagonal matrix and matrix multiplication is the operation with the second longest time occupation in the eleventh line of Algorithm 1 and the eleventh and fifth lines of Algorithm 2. The time cost of these operations increases rapidly with the dimension size of data. We find that the traditional processes introduce redundant calcula-tions because values exist only on the diagonal. Therefore, we accelerate the computation using the following parallel computation method:</p><formula xml:id="formula_1">𝑺𝑽 T = parallel(s k ⋅ 𝑽 k T ),<label>(3)</label></formula><p>where s k and 𝑽 k T represent the k-th value and row of matrix 𝑺 on the diagonal and matrix 𝑽 T . This parallel computation method takes advantage of parallelism and reduces redundant computations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Element-wise Product in Parallel</head><p>The sixth line to the ninth line of Algorithm 1 and the fifth line to the seventh line of Algorithm 2 are the element-wise products which can be calculated in parallel. We utilize the following parallel method to perform the element-wise product s ⋅ s, with m = #(s): parallel (s m−k+1</p><formula xml:id="formula_2">(0) = s k ⋅ s k ) , 1 ≤ k ≤ m. (<label>4</label></formula><formula xml:id="formula_3">)</formula><p>Figure <ref type="figure">3</ref>: A schematic diagram of the tensor's layout in memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Optimized Memory Access</head><p>Figure <ref type="figure">3</ref> exhibits a third-order tensor' column-major layout in memory. The tensor data is stored as a front slice of the column master. We adopt the column-major layout in memory to meet the data reading requirements of two libraries: cuSOLVER and cuBLAS. In addition, this kind of memory access method can directly get the mode-1 unfolding of the tensor without the tensor matricization reducing the overhead.</p><p>To reduce the memory footprint and the overhead of truncation operations in Algorithm 1 and Algorithm 2, the front truncation sub-sections of matrix 𝑽 T and vector s are calculated directly in the eleventh line of Algorithm 1 and the eleventh and fifth lines of Algorithm 2. We utilize the direct conversion in memory to reduce the overhead of tensor permuting operations in the tenth and eleventh lines of Algorithm 2. Moreover, through this optimized memory access method, the reshape operations are eliminated in Algorithms. These algorithms generate a lot of intermediate variables, which introduces much memory footprint and time overhead. Therefore, we reuse the allocated memory and dynamically delete and allocate intermediate variables in GPU memory to reduce memory consumption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Efficient Data Transfer</head><p>The input and output data volumes of TT and TR decompositions increase quickly with tensor sizes, which results in high time consumption of data transfer between CPUs and GPUs. To reduce the overhead of access space in transmission, the cores 𝒢 1 , 𝒢 2 , 𝒢 3 are combineded into an array 𝑐 . Meanwhile, [𝑛 1 , 𝑛 2 , 𝑛 3 ] are used to store dimensions of the input tensor and [r 0 , r 1 , r 2 , r 3 ] are used to store TR-ranks or TT-ranks. Therefore, k-th core is acquired through 𝒢 𝓀 = reshape(𝐜(∏ r j n j r j+1 k−1 j=1 , ∏ r j n j r j+1 k j=1</p><p>), [r k−1 , n , r k ]) , 𝐜(∏ r j n j r j+1 k−1 j=1</p><p>, ∏ r j n j r j+1 k j=1</p><p>) denotes the elements from ∏ r j n j r j+1 k−1 j=1</p><p>to ∏ r j n j r j+1 k j=1</p><p>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Performance Evaluation</head><p>Our experiments run on a server with 80 GB host memory. The server is equipped with two Intel Xeon E5-2640 V4 CPUs. Each CPU has ten cores supporting twenty hardware threads. Moreover, the server is equipped with one Tesla V100 GPU with 32GB device memory and 5,120 CUDA cores @1.53 GHz. We focus on the speedups of our experiment result: speedup = (CPU running time)/(GPU running time). The relative square error (RSE) is utilized to measure the error of data before and after decomposition: RSE = ||𝒜 − 𝒢 (1) ∘ 𝒢 (2) ∘ 𝒢 (3) || F /||𝒜|| F . The experiment tensor data are obtained by tensor contractions of three small tensors. For the Jacobi SVD, under single precision, the accuracy ɛ is set to10e-8, and the max iteration time is set to 100.</p><p>The speedups and running time of our optimized third-order TT decomposition are exhibited in Figure <ref type="figure" target="#fig_2">4</ref>. The tensor sizes vary from 100 × 100 × 100 to 1,200 × 1,200 × 1,200 . The CPU implemen-tation is referred from MATLAB code <ref type="bibr" target="#b8">[9]</ref>. Because of the GPU memory size, the maximum tensor size that can be processed is 1,200 × 1,200 × 1,200 on Tesla V100 GPU. Compared with the CPU implementations, the optimized GPU implement-tation obtains 14.25× on average and up to 24.80× speedups, which are higher than the GPU baseline implementation. The RSE of CPU and GPU are on the 10e-4 level. In our experiment, the speedups of the optimized implementations have a general upward trend. The speedups and running time of our optimized third-order TR decomposition are exhibited in Figure <ref type="figure" target="#fig_3">5</ref>. The CPU implementation is referred from MATLAB code <ref type="bibr" target="#b11">[11]</ref>. Compared with the CPU implementations, our optimized GPU implementation achieves 11.35 × on average and up to 21.77× speedups, which are higher than GPU baseline implementations. The RSE of CPU and GPU are also on the 10e-4 level. Because of the overhead of iteration and data transfer, the speedup is less than one when the size of tensor is 100 × 100 × 100. The speedups of the optimized TR decomposition keep increasing with the size of tensor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>High-performance third-order tensor-train and tensor-ring decomposition implementations on GPUs are proposed in this paper. To improve the efficiency of the algorithms, three optimization strategies are proposed. First, efficient memory access is proposed to reduce the memory footprint. Second, parallelization strategies are widely adopted in algorithms to match GPUs. Third, we use the highparallel Jacobi SVD to reduce time for critical calculations.</p><p>Moreover, we experimentally verify the advantages of our optimized decomposition algorithms. The third-order TT and TR decompositions get maximum 6.67× and 6.36× speedups. Implementing multi-GPU implementations of high-order TT and TR decompositions is our future work. Meanwhile, the optimized third-order TT and TR decomposition algorithms will be combined into the cuTensor library <ref type="bibr" target="#b5">[6]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The display diagram of the d-th tensor tensor-train decomposition.</figDesc><graphic coords="2,188.75,244.61,217.05,115.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The display diagram of the d-th tensor tensor-ring decomposition.</figDesc><graphic coords="2,188.75,579.02,217.10,116.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Speedups and running time of third-order TT decomposition on two Intel CPUs and a Tesla V100 GPU.</figDesc><graphic coords="5,189.20,215.27,216.53,163.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Speedups and running time of third-order TR decomposition on two Intel CPUs and a Tesla V100 GPU.The speedups and running time of our optimized third-order TR decomposition are exhibited in Figure5. The CPU implementation is referred from MATLAB code<ref type="bibr" target="#b11">[11]</ref>. Compared with the CPU implementations, our optimized GPU implementation achieves 11.35 × on average and up to 21.77× speedups, which are higher than GPU baseline implementations. The RSE of CPU and GPU are also on the 10e-4 level. Because of the overhead of iteration and data transfer, the speedup is less than one when the size of tensor is 100 × 100 × 100. The speedups of the optimized TR decomposition keep increasing with the size of tensor.</figDesc><graphic coords="6,188.00,72.00,217.80,163.35" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="3,191.12,332.35,212.67,260.93" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="4,180.56,131.00,233.92,294.20" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Context-aware tensor decomposition for relation prediction in social networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rettinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J]. Social Network Analysis and Mining</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="373" to="385" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">User-device authentication in mobile banking using APHEN for PARATUCK2 tensor decomposition</title>
		<author>
			<persName><forename type="first">J</forename><surname>Charlier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Falk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Data Mining Workshops (ICDMW). IEEE</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2018</biblScope>
			<biblScope unit="page" from="886" to="894" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Color demosaicking via nonlocal tensor representation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">2017</biblScope>
			<biblScope unit="page" from="1812" to="1816" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Tensor FISTA-Net for real-time snapshot compressive imaging</title>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="10933" to="10940" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Deep tensor admm-net for snapshot compressive imaging</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X Y</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="10223" to="10232" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">cuTensor-Tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X Y</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Parallel and Distributed Systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="595" to="610" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note>J</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Tensornetwork: A library for physics and machine learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Milsted</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1905.01330</idno>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">New fast and accurate Jacobi SVD algorithm</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Drmač</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Veselić</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIAM Journal on matrix analysis and applications</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="1322" to="1342" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Tensor-train decomposition</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">V</forename><surname>Oseledets</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">SIAM Journal on Scientific Computing</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="2295" to="2317" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Tensor ring decomposition</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1606.05535</idno>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">On algorithms for and computing with the tensor ring decomposition</title>
		<author>
			<persName><forename type="first">O</forename><surname>Mickelin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Karaman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J]. Numerical Linear Algebra with Applications</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">e2289</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
