=Paper=
{{Paper
|id=Vol-1787/481-485-paper-83
|storemode=property
|title=Parallel implementations of image reconstruction algorithms for X-ray microtomography
|pdfUrl=https://ceur-ws.org/Vol-1787/481-485-paper-83.pdf
|volume=Vol-1787
|authors= Victoria Tokareva,Andrey Gridin,Danila Kozhevnikov,Oksana Streltsova,Maxim Zuev
}}
==Parallel implementations of image reconstruction algorithms for X-ray microtomography==
<pdf width="1500px">https://ceur-ws.org/Vol-1787/481-485-paper-83.pdf</pdf>
<pre>
Parallel implementations of image reconstruction algorithms for
                   X-ray microtomography
      V. A. Tokarevaa, A. O. Gridin, D. A. Kozhevnikov, O. I. Streltsova,
                                  M. I. Zuev
                                  Joint Institute for Nuclear Research,
                       6, Joliot-Curie st., Dubna, Moscow region, 141980, Russia

                                         E-mail: a tokareva@jinr.ru


     Computer tomography is a widely used method for studying organisms or materials. In
handling CT data horizontal slices of studied sample are reconstructed using its X-ray shadow
projections at different angles. In modern tomographs a full object scan can be obtained with
high resolution during a short time period.
     Significant improvement of detector resolution and, consequently, rapid growth of acquired
data amounts for modern tomographic systems, demands development of more efficient image
reconstruction software. The X-ray microCT scanner MARS (Medipix All Resolution System),
being run at the Dzhelepov Laboratory of Nuclear Problems of Joint Institute for Nuclear
Research, was proposed for 3D spectroscopic sample examination. It uses a scheme with cone-
beam X-ray.
     Slice reconstruction is performed using an FDK (Feldkamp, Davis, and Kress, 1984) method,
an approximate algorithm for circular cone-beam microtomography. Its simplicity makes it
ubiquitously spread in CT.
     The FDK algorithm realization developed at JINR for working with MARS, currently used
for image reconstruction, requires significant time to process the data due to its computation
complexity. A priority task is to reduce it without loss of image quality. For this purpose, parallel
implementations of the reconstruction algorithm using OpenMP technology have been developed
and deployed for calculations on heterogeneous computing systems. Architecture based on task
management (OpenMP tasks mechanism) was developed for optimal job sharing between threads
and available devices (multi-core CPUs and Intel Xeon Phi co-processors).
     A comparative analysis of the developed parallel implementations has been performed, the
results on calculation speedup and efficiency are presented. We also compare our realizations
with the commercial ones. Optimal memory usage and employing parallel computing technologies
accelerate the computations up to 34 times and make our realizations comparable to commercial
analogs.
     The computations were performed on the heterogeneous cluster HybriLIT at the JINR
Laboratory of Information Technologies.
    Keywords: circular cone-beam CT, fast algorimth, high perfomance computing, OpenMP,
Xeon Phi


       c 2016 Victoria A. Tokareva, Andrey O. Gridin, Danila A. Kozhevnikov, Oksana I. Streltsova, Maxim I. Zuev


                                                                                                            481
Introduction
      Medipix semiconductor detectors [Ballabriga, 2011] based on GaAs sensors are being
studied at theLaboratory of Nuclear Problems JINR. The MARS (Medipix AllResolution
System) [Gongadze, 2015] microCT-scanner has been obtained for developing modern
tomographicmethods using Medipixdetectors.
      It is a CT scanner incorporating a new generation X-ray detector based on Medipix chip
and GaAs:Cr sensor.
      In modern tomography reconstruction computationally intensive algorithms are used,
requesting high efficiency of their software implementations. The reconstruction program written
at LNP works much slower than commercial analogs. The aim of this work is to realize the
reconstruction algorithm with small computation time using different parallel technologies such
as OpenMP and Xeon Phi.

Tomography
       Tomography means a slice-based study of the structure of various objects. Several types of
tomography exist, like X-ray, cathode ray, magnetic resonance, positron emission, ultrasonic,
optical coherent tomography etc. Among all the methods, the medical X-ray tomography,
historically named сomputed tomography (CT), has achieved the greatest success.
       CT is a technique for reconstructing slices of an object using X-ray measurements taken
from different angles around the sample. A cone-beam tomography, when a two-dimensional
detector is used for taking shadow projections of a sample, and X-rays form a cone with its base
on the detector and its apex on the source (figure 1a), is an effective way to increase the scanning
speed and use the rays otherwise removed by collimation.
       The common task can be formulated this way. The X-ray source irradiates the object from
different sides, and the object can scatter or absorbs X-ray of a given energy. As the result we
have sum information accomulated from the detector for each object’s projection, which can
be accociated with the real-valued attenuation coefficient’s function, defined on R2 (or R3 ) and
varied from point to point within the object. So, mathematical task is to obtain attenuation
cofficients in every point of the object.
       Methods most widely used in various applications and important in CT development
can be split into two classes: analitical and iterative. Analitical methods are based on precise
mathematical solutions of equations for image-reconstruction. Most of them use Fourier and
Radon transforms. Iterative methods are approximating the sample by the array of equal-density
voxels, that are unknown variables, connected with a system of linear equations with projections
as free terms. The systems are solved using iterative methods, giving the name for the class.

MARS microCT-scanner
      MARS CT scanner (figure 1b) records a series of two-dimensional shadow projections taken
around axis of rotation. The gantry (x-ray source and camera) with the scanning equipment
rotated around the scanned sample. The gantry rotation axis is horizontal. A test sample (up
to 100 mm in diameter and 300 mm length) placed in the center and can be moved along the
rotation axis.
      Medipix3 electronic (figure 1c) allows to count the photon number above the threshold for
two different thresholds in each pixel simultaneously. GaAs sensors register high-energy photons
with better efficiency than silicon sensors. One of the advantages of the photon counting scheme is
near absence of dark currents and descrease of the sample radiation dose. Besides, discrimination
based on signal amplitude allows to select various photon energy ranges for spectrometry.


                                                                                              482
     (a) Cone-beam tomography                 (b) MARS interior             (c) Medipix-based pixel detector

                                      Fig. 1. MARS microCT-scanner


FDK algorithm

      The FDK [Feldkamp, 1984] method, used in the current version of the software, is
an approximatereconstruction algorithm for circular cone-beam tomography. The simplicity of
the FDK methodhas made it the most used algorithm for cone-beam reconstruction.
      The algorithm consists of 3 steps: pre-weighting, backprojection and summation. The image
reconstruction formula is

                                                             − x sin β + y cos β
                   Z 2π
                                  R2                                                       R
  fFDK (x, y, z) =                                p̃F (β, R                      ,z                     )dβ, (1)
                    0   (R + x cos β + y sin β) 2           R + x cos β + y sin β R + x cos β + y sin β
where                                                                   !
                                                    R
                              p̃ (β, a, b) = √
                               F
                                                              p (β, a, b) ∗ gP (a)
                                                               F
                                                                                                            (2)
                                               R2 + a2 + b2
is a convolution of the input data pF (β, a, b) with ramp-filter; a, b, R, β are geometrical parameters;
z is a number of reconstructed slice; x, y are coordinates of the reconstructed point.
       The advantages of the FDK method are relative realization simplicity and fast convergence.
One of its disadvantages is that only the Its reducing without loss of image quality is a priority
task.central slice is reconstructed exactly, while other ones are biased, which is important in
medical applications. It also requires more input data than iterative methods, which results in
more irradiation for the patient.
       The result of scanning is a set of shadow projections obtained for different angles. The
input data after preprocessing is given to reconstruction program as a set of filtered synograms.
The program reconstruts the slices of the scanned object.

Parallel realization features
       OpenMP was the main parallel computing technology employed for this realization.
Besides, offloading the calculations to Intel Xeon Phi co-processors was used for speeding up
the image reconstruction.
       Since each pixel of the image being reconstructed is represented by a single-precision
floating-point number, vectorization [Using ..., 2016] becomes a key point in speeding up the
calculationson modern processors and espessially co-processors: using this technology a CPU
can performsimultaneously up to four operations and an Intel Xeon Phi co-processor up to 64
ones.
       The task management based on OpenMP tasks mechanism is designed as follows. CPU
and all available co-processors are polled in cycle for being occupied by reconstructing a slice. If


                                                                                                          483
any device is found to finish slice reconstruction, saving this slice is issued to a separate single-
threaded CPU task and the device (either the co-processor or non-busy CPU threads) is occupied
by reconstructing next slice.
       Comparative analysis of realization’s quality was performed using next quality indicators:
1) calculation time vs. the number of threads/processes; 2) speedup T 1/ T n , where T 1 is calculation
time for one thread and T n is calculation time for n threads; 3) efficiency T 1/(nT n) where n is the
number of threads or processes, and T 1 and T n are defined above.
       We also compare our realization to the commercial ones [Octopus, 2016], that we
used for imagereconstruction as well, using two main criteria: memory usage and computation
time.
       The realization was tested on the cluster node with two multi-core Intel Xeon E5-2695 v2
CPUs, each of them has 12 physical cores with 2 virtual cores for each one. Thus the maximal
number of threads reached 48.
       Figure 2a shows that multithreading allows to decrease the the computation time
significantly, and efficiency decreases with increasing number of threads. Nevertheless we cannot
define the plateau, meaning that there is no such number of threads that its further inscrease
doesn’t speed up the calculation. The acceleration achieved reaches 16 (figure 2b), a small
fluctuation near 24 threads is explained by the hyperthreading, when threads above 24th are
assigned to already occupied physical cores. The developed realization shows good efficiency
(figure 2c) above 60 percents for any number of threads.
    Time [min]


                                                       Speedup


                                                                                                    Efficiency
                                                                                                                   1
                                                                 16
                 70
                                                                 14                                              0.8
                 60
                                                                 12
                 50
                                                                 10                                              0.6
                 40
                                                                 8
                 30                                                                                              0.4
                                                                 6
                 20                                              4
                                                                                                                 0.2
                 10                                              2
                 0                                               0                                                0
                  0      10   20    30    40     50               0   10   20    30    40     50                   0   10   20    30    40     50
                                            #threads                                     #threads                                         #threads

                      (a) Computation time                            (b) Efficiency                                    (c) Speedup

    Fig. 2. Quality indicators of OpenMP realization for multicore CPU vs. the number of threads.

       On the figure 3 the results for running Xeon Phi offload mode realization are shown. The
realization was tested using Intel Xeon Phi 7120P co-processor, which has 61 physical cores with
4 virtual cores for each one. As can be seen from figure 3a, Phi version works significantly slower
than multicore CPU version for a small number of threads, because of less power of co-processor
cores compared to CPU ones. The plateau is achived at 140 threads, and then speeding up almost
stops, and efficiency (figure 3c) decreases critically. So we can consider that the optimal number
of threads is less than 140. The maximum speed-up we have for this version is up to 37 times
(figure 3b), and the efficiency for the optimal thread number is more than 50 percent.
    Time [min]


                                                       Speedup


                                                                                                    Efficiency


             250                                                 40                                                1
                                                                 35
             200                                                                                                 0.8
                                                                 30
             150                                                 25
                                                                                                                 0.6
                                                                 20
             100                                                 15                                              0.4
                                                                 10
                 50                                                                                              0.2
                                                                 5
                 0                                               0                                                0
                  0      50   100   150   200   250               0   50   100   150   200   250                   0   50   100   150   200   250
                                            #threads                                     #threads                                         #threads

                      (a) Computation time                            (b) Efficiency                                    (c) Speedup

Fig. 3. Quality indicators of OpenMP realization for Xeon Phi co-processor vs. the number of threads.


                                                                                                                                                    484
         The histogram 4a shows that the minimal execution time for the parallel version is much
  less than for the single-threaded one, and the Phi version has the best performance. Meanwhile
  our realizations show worse times compared to the commercial ones despite using more computing
  power. Concerning memory usage, our algorithms tend to outperform proprietary software
  (figure 4b).


                         (a) Processing time                  (b) Memory usage

       Fig. 4. Comparing preformance for different realizations of image reconstruction algorithms


  Results and outlook
        The multi-threaded realization of FDK algorithm based on the OpenMP parallelization
  technology and using Intel Xeon Phi co-processors for calculation offloading was implemented.
  The dependence of its efficiency from the number of threads on physical and logical processor
  cores was received. It was found that parallel algorithms allow to reduce the time of calculations
  by the factor of 29 using the multicore component of the cluster. The efficiency of the OpenMP
  implementation is more than 60 percent for 48 CPU threads and more than 70 percent for 240
  Xeon Phi threads.
        Future plans include implementing parallel iterative algorithms to improve the quality
  of reconstructed cross sections and developing hierarchical algorithms to speed up the
  reconstruction.

  Acknowledgments
        Authors would like to express thier gratitude to HybriLIT heterogeneous cluster team for
  providing access to the cluster and for knowledge sharing and to students Maxim Lapin and
  Anvar Khakimov for studing optimal compilation flags and help with software documentation.

  References
Ballabriga R., Campbell M., Heijne E., Llopart X., Tlustos L., Wong W. Medipix3: A 64 k pixel
       detector readout chip working in single photon counting mode with improved spectrometric
       performance // Nucl. Instrum. Methods Phys. Res. A. — 2011. — Vol. 633. — P. S15–S18.
Gongadze A. et al. Alignment and resolution studies of a MARS CT scanner // Physics of
       Particles and Nuclei Letters. — 2015. — Vol. 12. — P. 725–735
Feldkamp L. A., Davis L. C., and Kress J. W. Practical cone-beam algorithm // Journal of the
       Optical Society of America A. — 1984. — Vol. 1. — P. 612–619
Using Automatic Vectorization, Intel [Electronic resource]: https://software.intel.com/en-
       us/node/522572 (accessed 31.10.2016)
Octopus Imaging Software [Electronic resource]: https://octopusimaging.eu/ (accessed
       31.10.2016)


                                                                                                     485

</pre>