=Paper=
{{Paper
|id=Vol-1787/481-485-paper-83
|storemode=property
|title=Parallel implementations of image reconstruction algorithms for X-ray microtomography
|pdfUrl=https://ceur-ws.org/Vol-1787/481-485-paper-83.pdf
|volume=Vol-1787
|authors= Victoria Tokareva,Andrey Gridin,Danila Kozhevnikov,Oksana Streltsova,Maxim Zuev
}}
==Parallel implementations of image reconstruction algorithms for X-ray microtomography==
Parallel implementations of image reconstruction algorithms for X-ray microtomography V. A. Tokarevaa, A. O. Gridin, D. A. Kozhevnikov, O. I. Streltsova, M. I. Zuev Joint Institute for Nuclear Research, 6, Joliot-Curie st., Dubna, Moscow region, 141980, Russia E-mail: a tokareva@jinr.ru Computer tomography is a widely used method for studying organisms or materials. In handling CT data horizontal slices of studied sample are reconstructed using its X-ray shadow projections at different angles. In modern tomographs a full object scan can be obtained with high resolution during a short time period. Significant improvement of detector resolution and, consequently, rapid growth of acquired data amounts for modern tomographic systems, demands development of more efficient image reconstruction software. The X-ray microCT scanner MARS (Medipix All Resolution System), being run at the Dzhelepov Laboratory of Nuclear Problems of Joint Institute for Nuclear Research, was proposed for 3D spectroscopic sample examination. It uses a scheme with cone- beam X-ray. Slice reconstruction is performed using an FDK (Feldkamp, Davis, and Kress, 1984) method, an approximate algorithm for circular cone-beam microtomography. Its simplicity makes it ubiquitously spread in CT. The FDK algorithm realization developed at JINR for working with MARS, currently used for image reconstruction, requires significant time to process the data due to its computation complexity. A priority task is to reduce it without loss of image quality. For this purpose, parallel implementations of the reconstruction algorithm using OpenMP technology have been developed and deployed for calculations on heterogeneous computing systems. Architecture based on task management (OpenMP tasks mechanism) was developed for optimal job sharing between threads and available devices (multi-core CPUs and Intel Xeon Phi co-processors). A comparative analysis of the developed parallel implementations has been performed, the results on calculation speedup and efficiency are presented. We also compare our realizations with the commercial ones. Optimal memory usage and employing parallel computing technologies accelerate the computations up to 34 times and make our realizations comparable to commercial analogs. The computations were performed on the heterogeneous cluster HybriLIT at the JINR Laboratory of Information Technologies. Keywords: circular cone-beam CT, fast algorimth, high perfomance computing, OpenMP, Xeon Phi c 2016 Victoria A. Tokareva, Andrey O. Gridin, Danila A. Kozhevnikov, Oksana I. Streltsova, Maxim I. Zuev 481 Introduction Medipix semiconductor detectors [Ballabriga, 2011] based on GaAs sensors are being studied at theLaboratory of Nuclear Problems JINR. The MARS (Medipix AllResolution System) [Gongadze, 2015] microCT-scanner has been obtained for developing modern tomographicmethods using Medipixdetectors. It is a CT scanner incorporating a new generation X-ray detector based on Medipix chip and GaAs:Cr sensor. In modern tomography reconstruction computationally intensive algorithms are used, requesting high efficiency of their software implementations. The reconstruction program written at LNP works much slower than commercial analogs. The aim of this work is to realize the reconstruction algorithm with small computation time using different parallel technologies such as OpenMP and Xeon Phi. Tomography Tomography means a slice-based study of the structure of various objects. Several types of tomography exist, like X-ray, cathode ray, magnetic resonance, positron emission, ultrasonic, optical coherent tomography etc. Among all the methods, the medical X-ray tomography, historically named сomputed tomography (CT), has achieved the greatest success. CT is a technique for reconstructing slices of an object using X-ray measurements taken from different angles around the sample. A cone-beam tomography, when a two-dimensional detector is used for taking shadow projections of a sample, and X-rays form a cone with its base on the detector and its apex on the source (figure 1a), is an effective way to increase the scanning speed and use the rays otherwise removed by collimation. The common task can be formulated this way. The X-ray source irradiates the object from different sides, and the object can scatter or absorbs X-ray of a given energy. As the result we have sum information accomulated from the detector for each object’s projection, which can be accociated with the real-valued attenuation coefficient’s function, defined on R2 (or R3 ) and varied from point to point within the object. So, mathematical task is to obtain attenuation cofficients in every point of the object. Methods most widely used in various applications and important in CT development can be split into two classes: analitical and iterative. Analitical methods are based on precise mathematical solutions of equations for image-reconstruction. Most of them use Fourier and Radon transforms. Iterative methods are approximating the sample by the array of equal-density voxels, that are unknown variables, connected with a system of linear equations with projections as free terms. The systems are solved using iterative methods, giving the name for the class. MARS microCT-scanner MARS CT scanner (figure 1b) records a series of two-dimensional shadow projections taken around axis of rotation. The gantry (x-ray source and camera) with the scanning equipment rotated around the scanned sample. The gantry rotation axis is horizontal. A test sample (up to 100 mm in diameter and 300 mm length) placed in the center and can be moved along the rotation axis. Medipix3 electronic (figure 1c) allows to count the photon number above the threshold for two different thresholds in each pixel simultaneously. GaAs sensors register high-energy photons with better efficiency than silicon sensors. One of the advantages of the photon counting scheme is near absence of dark currents and descrease of the sample radiation dose. Besides, discrimination based on signal amplitude allows to select various photon energy ranges for spectrometry. 482 (a) Cone-beam tomography (b) MARS interior (c) Medipix-based pixel detector Fig. 1. MARS microCT-scanner FDK algorithm The FDK [Feldkamp, 1984] method, used in the current version of the software, is an approximatereconstruction algorithm for circular cone-beam tomography. The simplicity of the FDK methodhas made it the most used algorithm for cone-beam reconstruction. The algorithm consists of 3 steps: pre-weighting, backprojection and summation. The image reconstruction formula is − x sin β + y cos β Z 2π R2 R fFDK (x, y, z) = p̃F (β, R ,z )dβ, (1) 0 (R + x cos β + y sin β) 2 R + x cos β + y sin β R + x cos β + y sin β where ! R p̃ (β, a, b) = √ F p (β, a, b) ∗ gP (a) F (2) R2 + a2 + b2 is a convolution of the input data pF (β, a, b) with ramp-filter; a, b, R, β are geometrical parameters; z is a number of reconstructed slice; x, y are coordinates of the reconstructed point. The advantages of the FDK method are relative realization simplicity and fast convergence. One of its disadvantages is that only the Its reducing without loss of image quality is a priority task.central slice is reconstructed exactly, while other ones are biased, which is important in medical applications. It also requires more input data than iterative methods, which results in more irradiation for the patient. The result of scanning is a set of shadow projections obtained for different angles. The input data after preprocessing is given to reconstruction program as a set of filtered synograms. The program reconstruts the slices of the scanned object. Parallel realization features OpenMP was the main parallel computing technology employed for this realization. Besides, offloading the calculations to Intel Xeon Phi co-processors was used for speeding up the image reconstruction. Since each pixel of the image being reconstructed is represented by a single-precision floating-point number, vectorization [Using ..., 2016] becomes a key point in speeding up the calculationson modern processors and espessially co-processors: using this technology a CPU can performsimultaneously up to four operations and an Intel Xeon Phi co-processor up to 64 ones. The task management based on OpenMP tasks mechanism is designed as follows. CPU and all available co-processors are polled in cycle for being occupied by reconstructing a slice. If 483 any device is found to finish slice reconstruction, saving this slice is issued to a separate single- threaded CPU task and the device (either the co-processor or non-busy CPU threads) is occupied by reconstructing next slice. Comparative analysis of realization’s quality was performed using next quality indicators: 1) calculation time vs. the number of threads/processes; 2) speedup T 1/ T n , where T 1 is calculation time for one thread and T n is calculation time for n threads; 3) efficiency T 1/(nT n) where n is the number of threads or processes, and T 1 and T n are defined above. We also compare our realization to the commercial ones [Octopus, 2016], that we used for imagereconstruction as well, using two main criteria: memory usage and computation time. The realization was tested on the cluster node with two multi-core Intel Xeon E5-2695 v2 CPUs, each of them has 12 physical cores with 2 virtual cores for each one. Thus the maximal number of threads reached 48. Figure 2a shows that multithreading allows to decrease the the computation time significantly, and efficiency decreases with increasing number of threads. Nevertheless we cannot define the plateau, meaning that there is no such number of threads that its further inscrease doesn’t speed up the calculation. The acceleration achieved reaches 16 (figure 2b), a small fluctuation near 24 threads is explained by the hyperthreading, when threads above 24th are assigned to already occupied physical cores. The developed realization shows good efficiency (figure 2c) above 60 percents for any number of threads. Time [min] Speedup Efficiency 1 16 70 14 0.8 60 12 50 10 0.6 40 8 30 0.4 6 20 4 0.2 10 2 0 0 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 #threads #threads #threads (a) Computation time (b) Efficiency (c) Speedup Fig. 2. Quality indicators of OpenMP realization for multicore CPU vs. the number of threads. On the figure 3 the results for running Xeon Phi offload mode realization are shown. The realization was tested using Intel Xeon Phi 7120P co-processor, which has 61 physical cores with 4 virtual cores for each one. As can be seen from figure 3a, Phi version works significantly slower than multicore CPU version for a small number of threads, because of less power of co-processor cores compared to CPU ones. The plateau is achived at 140 threads, and then speeding up almost stops, and efficiency (figure 3c) decreases critically. So we can consider that the optimal number of threads is less than 140. The maximum speed-up we have for this version is up to 37 times (figure 3b), and the efficiency for the optimal thread number is more than 50 percent. Time [min] Speedup Efficiency 250 40 1 35 200 0.8 30 150 25 0.6 20 100 15 0.4 10 50 0.2 5 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 #threads #threads #threads (a) Computation time (b) Efficiency (c) Speedup Fig. 3. Quality indicators of OpenMP realization for Xeon Phi co-processor vs. the number of threads. 484 The histogram 4a shows that the minimal execution time for the parallel version is much less than for the single-threaded one, and the Phi version has the best performance. Meanwhile our realizations show worse times compared to the commercial ones despite using more computing power. Concerning memory usage, our algorithms tend to outperform proprietary software (figure 4b). (a) Processing time (b) Memory usage Fig. 4. Comparing preformance for different realizations of image reconstruction algorithms Results and outlook The multi-threaded realization of FDK algorithm based on the OpenMP parallelization technology and using Intel Xeon Phi co-processors for calculation offloading was implemented. The dependence of its efficiency from the number of threads on physical and logical processor cores was received. It was found that parallel algorithms allow to reduce the time of calculations by the factor of 29 using the multicore component of the cluster. The efficiency of the OpenMP implementation is more than 60 percent for 48 CPU threads and more than 70 percent for 240 Xeon Phi threads. Future plans include implementing parallel iterative algorithms to improve the quality of reconstructed cross sections and developing hierarchical algorithms to speed up the reconstruction. Acknowledgments Authors would like to express thier gratitude to HybriLIT heterogeneous cluster team for providing access to the cluster and for knowledge sharing and to students Maxim Lapin and Anvar Khakimov for studing optimal compilation flags and help with software documentation. References Ballabriga R., Campbell M., Heijne E., Llopart X., Tlustos L., Wong W. Medipix3: A 64 k pixel detector readout chip working in single photon counting mode with improved spectrometric performance // Nucl. Instrum. Methods Phys. Res. A. — 2011. — Vol. 633. — P. S15–S18. Gongadze A. et al. Alignment and resolution studies of a MARS CT scanner // Physics of Particles and Nuclei Letters. — 2015. — Vol. 12. — P. 725–735 Feldkamp L. A., Davis L. C., and Kress J. W. Practical cone-beam algorithm // Journal of the Optical Society of America A. — 1984. — Vol. 1. — P. 612–619 Using Automatic Vectorization, Intel [Electronic resource]: https://software.intel.com/en- us/node/522572 (accessed 31.10.2016) Octopus Imaging Software [Electronic resource]: https://octopusimaging.eu/ (accessed 31.10.2016) 485