1. Introduction

R. Vladyka)

Development and analysis of a parallel method for detecting chromosomal translocations and inversions in DNA sequences⋆

Lesia Mochurad

Renata Vladyka

0 0 Lviv Polytechnic National University , 12 Bandera street, Lviv, 79013 , Ukraine

2023

000 0 0002

In the current context of the growing volume of genetic research and the need for the fast and accurate detection of chromosomal abnormalities, developing efficient algorithms that provide high performance, scalability, and accuracy in comparing DNA sequences is a critical task. In this study, the aim was to create and analyze a parallel algorithm for detecting chromosomal translocations and inversions in DNA sequences. For this purpose, we implemented an algorithm capable of comparing DNA sequences to detect genetic abnormalities, including translocations and inversions. The use of parallel computing made it possible to significantly improve the efficiency of the analysis, reducing the algorithm's execution time and increasing scalability. Particular attention was paid to how the algorithm uses multithreading and how it efficiently distributes the load between threads. To detect inversions, an algorithm was developed that compares two DNA sequences and detects possible changes in nucleotide sequences. To find translocations, the Needleman-Wunsch algorithm was applied, which uses parallel computing to optimally align genetic fragments. The results of the algorithms have shown high efficiency in detecting chromosomal mutations, which is important for genetic research and can be used for medical diagnosis.

eol>Translocation inversion DNA sequence Needleman-Wunsch algorithm nucleotides

1. Introduction

Thus, the diagnosis and prediction of chromosomal inversions and translocations are critical in genetic research and medical practice. Detection of these changes contributes to early diagnosis, which allows timely treatment and monitoring of patients. Predicting the risks of developing certain diseases based on the identified mutations also helps patients to realize their genetic risks and opportunities for prevention. To solve this problem, the Needleman-Wunsch algorithm is used [ 2 ]. However, one of the main problems with this algorithm is its high computational complexity; this algorithm has quadratic complexity. The latter makes it difficult to process large DNA sequences. The development of a parallel algorithm can reduce the execution time by simultaneously processing parts of the data [ 3 ], [ 4 ], [ 5 ]. Ensuring efficient parallelization requires balanced load distribution among threads to prevent idle time, which may involve optimizing data structures and algorithms. Proper synchronization is also essential to avoid conflicts when accessing shared resources, especially in algorithms with interdependent results. Maintaining result accuracy during parallel processing may require extra validation. Lastly, scalability is a key factor. It should be able to scale across different hardware platforms, from personal computers to supercomputer clusters, to achieve optimal results in different environments [6], [7].

Solving all the problems described above is an important step to improve the speed and accuracy of chromosomal abnormalities diagnosis, which, in turn, will increase the efficiency of genetic research and clinical practice.

In the field of genetic research, there are many important aspects that relate to the study of genetic mutations, their diagnosis, and their impact on human health. Various articles offer valuable insights that emphasize the importance of such mutations in medicine, evolution, and biology. For example, paper [8] discusses the work of the Human Gene Mutation Database (HGMD), which collects and analyzes information about human genetic mutations. The authors describe the functions of the HGMD, its use in clinical diagnostics and research, as well as plans to expand the database, which include the integration of GTEx project data and the introduction of automated tools for mutation prediction. The advantages of the article are a detailed overview of the HGMD functions and an emphasis on its role in clinical practice, but the disadvantages are the complexity of automatic identification of new mutations and the lack of examples of real-world use of the HGMD in clinical cases.

In [9], the authors analyze different types of genetic mutations and their impact on human health, including the occurrence of diseases due to mutations in genes or chromosomes. The advantages of this article are a clear description of the causes and consequences of genetic diseases, as well as a detailed overview of the types of mutations. However, there are also disadvantages, such as general statements that are not always supported by specific data, as well as a lack of details about research methods. Nevertheless, the article is useful for familiarizing oneself with genetic mutations and their impact on the human body.

Paper [10] investigates the use of biodosimetry to assess radiation doses in US military personnel who participated in nuclear tests after World War II. The main goal of this study is to assess chromosomal aberrations, such as inversions and translocations, as a method of retrospective biodosimetry. The results show that inversions can be an effective way to establish radiation doses received decades ago and that their combination with translocations improves the accuracy of the estimate. The study also indicates the influence of age and smoking status on the frequency of aberrations, which is higher in older individuals and smokers.

The PETI method is discussed in [11]. It effectively induces precise DNA recombinations in human cells. PETI successfully generates recombination and inversion mutations in endogenous genomes and can correct disease-related inversions and translocations, making it promising for disease modeling and therapeutic approaches. This method is characterized by high accuracy and flexibility compared to other genome editing methods such as Prime-Del and twinPE, but it is still limited in its use for episomal mutations and requires further research to confirm its results.

The authors of [12] investigate the effectiveness of the third generation of sequencing analysis for detecting breakpoints in unbalanced chromosomal translocations. The use of Nanopore long reads allows for the detection of microdeletions, microinsertions, and other structural changes that complement translocations. This approach provides accurate genetic information necessary for diagnosis and treatment. However, the high cost and possible experimental failures point to the need for further research and optimization of methods.

The paper [13] discusses the role of chromosomal rearrangements in the genetic differentiation and evolution of populations, in particular, using the example of the spiny frog with a chromosomal translocation polymorphism. The authors use whole-chromosome staining (WCP) and genetic analysis to confirm the common origin of the translocations. They found that translocated chromosomes have a higher level of genetic differentiation due to recombination suppression, which contributes to genetic diversity and population differentiation. The article also discusses the mechanisms of recombination suppression, such as the accumulation of repetitive sequences and the capture of adapted alleles.

This review is rounded off by article [14], which analyzes the importance of chromosomal inversions in the plant kingdom and their role in plant evolution. The authors note that inversions are common in many groups of plants and are often associated with locally advantageous traits that promote adaptation and speciation. The article also discusses methods for detecting inversions, such as karyotyping, genetic mapping, and high-fidelity sequencing. The authors call for further research to better understand the origin, evolutionary role, and molecular mechanisms of inversions in plants.

Thus, the analysis of scientific studies shows that existing approaches to diagnosing and predicting chromosomal inversions and translocations have their advantages, but also significant disadvantages. For example, the HGMD database, although providing useful information for clinical practice, faces problems with the automatic identification of new mutations. The use of biodosimetry to estimate radiation doses, in particular by analyzing chromosomal aberrations, has proven effective, but is dependent on factors such as age and smoking status. The PETI method demonstrates high accuracy in genetic editing, but requires additional research to confirm its widespread use. Chromosomal rearrangements confirm their role in the genetic differentiation of populations, but require a deeper study of the mechanisms of recombination suppression. The study of inversions in plants also emphasizes the need for further research into the evolutionary mechanisms of these changes. This confirms the importance of continuing research in this area. In particular, the need to develop a parallel algorithm for the diagnosis and prediction of chromosomal inversions and translocations is driven by the need for faster and more accurate detection of genetic abnormalities, which is critical for effective treatment and management of diseases.

The main contribution of this paper:  A new parallel method for detecting chromosomal translocations and inversions using multithreading and parallel computing (OpenMP) is proposed. This method ensures optimal load distribution between threads, reducing processing time and increasing scalability.  The Needleman-Wunsch alignment method has been parallelized to compare sequences, which has increased the accuracy of translocation detection. This approach allows for more accurate identification of chromosomal abnormalities, outperforming traditional sequential alignment methods.  When using eight threads, the proposed method achieves speedups of up to 3.8 for translocations and 3.4 for inversions. This confirms its suitability for processing large amounts of data and the possibility of further optimization on multi-core or GPUs.

2. Methodic and materials 2.1. General scheme of the mutations method for detecting chromosomal

The task is to develop a method for detecting mutations in DNA sequences. There are two sets of DNA for inversion and four for translocation: half of them are taken as a reference, i.e. without mutations, and the rest with mutations. This method should be able to detect two types of mutations: inversions and translocations.

Let and be DNA sequences of length and , respectively, =( 1, 2, ..., ) and =( 1, 2, ..., ), where i and i take the values {A, C, T, G}. Sequence is the same sequence , but at a certain interval from i to i + k the sequence is inverted, i.e. [i:i+k] ! [i:i+k] and [i:i+k] = reverse( [i:i+k]). To search for inversions, the algorithm comparing two DNA sequences is parallelized. This algorithm detects mismatches in nucleotide sequences and determines the gaps where nucleotides in DNA molecules do not match. After that, the found gaps are checked for inversions. The result of this check is the determination of the gaps where inversions are detected.

To search for translocations, we need to parallelize the Needleman-Wunsch algorithm, which consists in constructing an optimal alignment, using the optimal alignments of the initial fragments of the original sequences obtained in the previous steps. For two sequences and with elements

I (0<i<n ) and j (0<j<m), we construct the matrix F. The element of this matrix contains the weight (score) of the best alignment of elements 1, 2, ..., i and 1, 2, ..., j of the sequences and , respectively. We build the matrix F recursively. We assign the starting point zero weight F(0,0)=0. Then we fill in the matrix in ascending order of both indices, i.e. from the upper left corner to the lower right. If we have already defined F(i-1 ,j-1), F(i-1, j), F(i, j-1), then we can define F(i, j), as , = 푚

, , , , ( , – the value in the alignment matrix in the position , ;

) – the value of the penalty (or reward) for matching (match/mismatch) characters and

; where:   

– the value of the gap penalty, which can be negative or positive.

This formula recursively calculates the values in each cell of the alignment matrix , , considering the three possible ways to reach the current cell: diagonally (matching), vertically (insertion/deletion in the first sequence), and horizontally (insertion/deletion in the second sequence). Figure 1 shows a general flowchart of the chromosomal mutation detection algorithm

2.2 Parallelizing the detection of chromosomal inversions and translocations

This paper uses OpenMP technology to detect chromosomal inversions and translocations in the genome. OpenMP allows parallelization of the code, reducing computation time and increasing program efficiency. Testing on a different number of threads showed a significant speedup of the algorithm. By working with shared memory, OpenMP simplifies the exchange of data between threads, which helps to synchronize them.

The main work of the algorithm for finding chromosomal inversions can be divided into two steps: 1. Detection of chromosomal inversion gaps. 2. Analysis and verification of the results.

The detection of chromosomal inversion gaps is performed by comparing the reference with the target in order to detect mismatches. Accordingly, in a sequential algorithm, the search is performed from the beginning of the array to its end (see Figure 2).

3. Results

The data for testing were taken from the DNA sequence dataset [17]. This dataset is a dataset that contains DNA sequences and contains information about nucleotide sequences consisting of four possible bases: adenine (A), cytosine (C), guanine (G), and thymine (T).

First, we present the results of the parallel algorithm using OpenMP technology to find inversions and translocations (see Table 1 and Table 2, respectively). The program execution time will depend on the number of threads and the number of sequences in the file. 1045 1.8 1.2 0.9 0.8 4177 25 14.5 10 9 5801 57 31 21 15 Numerical experiments were performed on four different samples of nucleotide sequences. According to Table 1, it can be concluded that parallelization is not advisable for small values of n, Further, based on the results obtained, we calculate the speedup = = ( ) of the proposed parallel method, where ( ) – the execution time of the and the calculation speed improves with increasing data volume. Similarly, Table 2 shows that a small matrix dimension is not optimal for a large number of threads. Similarly to the algorithm for detecting inversions for translocations, the parallelization efficiency improves with increasing data size. sequential algorithm, ( ) – the execution time of the parallel algorithm, and threads. The results are shown in Table 3 and Table 4 1045 0.19 0.25 0.28 4177 0.22 0.31 0.35 5801 0.23 0.34 0.48

From the results, we can see that the proposed parallel method yields good speedup and efficiency with increasing number of threads and input data dimensionality, which indicates that the method is well scalable. It should be noted that all experiments were performed on a computer with eight logical processors. Therefore, they can be significantly improved with a more powerful computing system. The paper also investigates the accuracy achieved by the proposed approach for finding inversions and translocations. To do this, the program was run 1000 times with different lengths of DNA sets with pre-introduced mutations randomly, and the final result was calculated as an average. Table 5 presents the accuracy results of the proposed method on different data sizes, and Table 6 compares the accuracy of the Smith-Waterman algorithm [18] and the NeedlemanWunsch algorithm. Needleman-Wunsch algorithm 200 500 1000 96.7

4. Conclusions and future research

In this paper, we investigated the effectiveness of parallel algorithms for detecting chromosomal mutations, such as inversions and translocations, using OpenMP technology. The algorithms were developed to compare DNA sequences and detect deviations in nucleotide sequences, which allowed for high accuracy in identifying inversions. For translocations, the parallel Needleman-Wunsch algorithm was used to ensure optimal alignment of DNA fragments. Numerical experiments demonstrated a significant performance improvement on different datasets: the best performance was 3.4 for inversions and 3.8 for translocations, achieved using eight threads. The results indicate an improvement in performance with increasing amounts of input data and the possibility of further optimization of the results obtained through the development of multi-core computing systems. To further improve the efficiency of the developed algorithm, we can consider its implementation on a graphics processing unit using CUDA technology, which will potentially provide even higher data processing speed.

5. Acknowledgements

The authors would like to thank the Armed Forces of Ukraine for providing security to perform this work. This work has become possible only because of the resilience and courage of the Ukrainian Army.

6. Declaration on Generative AI

During the preparation of this paper, the authors utilized Grammarly to verify spelling and grammar accuracy.

7. References

[6] S. Saif, P. Das, S. Biswas, S. Khan, M. A. Haq, and V. Kovtun, “A secure data transmission framework for IoT enabled healthcare,” Heliyon, vol. 10, no. 16, p. e36269, Aug. 2024, doi: 10.1016/j.heliyon.2024.e36269. [7] V. V. Shkarupylo, I. V. Blinov, A. A. Chemeris, V. V. Dusheba, and J. A. J. Alsayaydeh, “On Applicability of Model Checking Technique in Power Systems and Electric Power Industry,” in Systems, Decision and Control in Energy III, vol. 399, A. Zaporozhets, Ed., in Studies in Systems, Decision and Control, vol. 399. , Cham: Springer International Publishing, 2022, pp. 3–21. doi: 10.1007/978-3-030-87675-3_1. [8] P. D. Stenson et al., “The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting,” Hum. Genet., vol. 139, no. 10, pp. 1197–1207, Oct. 2020, doi: 10.1007/s00439-020-02199-3. [9] S. Banoon, T. Salih, and A. Ghasemian, “Genetic Mutations and Major Human Disorders: A Review,” Egypt. J. Chem., vol. 0, no. 0, pp. 0–0, Oct. 2021, doi: 10.21608/ejchem.2021.98178.4575. [10] M. J. McKenna et al., “Chromosome Translocations, Inversions and Telomere Length for Retrospective Biodosimetry on Exposed U.S. Atomic Veterans,” Radiat. Res., vol. 191, no. 4, p. 311, Feb. 2019, doi: 10.1667/RR15240.1. [11] J. Kweon, H.-Y. Hwang, H. Ryu, A.-H. Jang, D. Kim, and Y. Kim, “Targeted genomic translocations and inversions generated using a paired prime editing strategy,” Mol. Ther., vol. 31, no. 1, pp. 249–259, Jan. 2023, doi: 10.1016/j.ymthe.2022.09.008. [12] X. Zeng et al., “Gene sequencing and result analysis of balanced translocation carriers by third-generation gene sequencing technology,” Sci. Rep., vol. 13, no. 1, p. 7004, Apr. 2023, doi: 10.1038/s41598-022-20356-8. [13] Y. Xia, X. Yuan, W. Luo, S. Yuan, and X. Zeng, “The Origin and Evolution of Chromosomal Reciprocal Translocation in Quasipaa boulengeri (Anura, Dicroglossidae),” Front. Genet., vol. 10, p. 1364, Jan. 2020, doi: 10.3389/fgene.2019.01364. [14] K. Huang and L. H. Rieseberg, “Frequency, Origins, and Evolutionary Role of Chromosomal Inversions in Plants,” Front. Plant Sci., vol. 11, p. 296, Mar. 2020, doi: 10.3389/fpls.2020.00296. [15] N. F. Binti Abdul Rahim et al., “Channel Congestion Control in VANET for Safety and NonSafety Communication: A Review,” in 2021 6th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE), Kedah, Malaysia: IEEE, Dec. 2021, pp. 1–6. doi: 10.1109/ICRAIE52900.2021.9704017. [16] I. Fedorchenko, A. Oliinyk, J. A. J. Alsayaydeh, A. Kharchenko, A. Stepanenko, and V. Shkarupylo, “MODIFIED GENETIC ALGORITHM TO DETERMINE THE LOCATION OF THE DISTRIBUTION POWER SUPPLY NETWORKS IN THE CITY,” Dec. 2020, doi: 10.5281/ZENODO.5163692. [17] Nagesh Singh Chauhan, “DNA sequence dataset.” 2021. [Online]. Available: https://www.kaggle.com/datasets/nageshsingh/dna-sequence-dataset [18] Z. Xia et al., “A Review of Parallel Implementations for the Smith–Waterman Algorithm,” Interdiscip. Sci. Comput. Life Sci., vol. 14, no. 1, pp. 1–14, Mar. 2022, doi: 10.1007/s12539-021-00473-0.

[1]

Dias et al., “Translocations and inversions: major chromosomal rearrangements during Vigna (Leguminosae) evolution,” Theor . Appl. Genet., vol. 137 , no. 1 , p. 29 , Jan . 2024 , doi: 10.1007/s00122-024-04546-8.

[2]

A. E. E.-D.

Rashed ,

H. M.

Amer ,

El-Seddek , and H. E.-D. Moustafa, “ Sequence Alignment Using Machine Learning-Based Needleman-Wunsch Algorithm ,” IEEE Access, vol. 9 , pp. 109522 - 109535 , 2021 , doi: 10.1109/ACCESS. 2021 . 3100408 .

[3]

Mochurad ,

Sydor , and

Ratinskiy , “ A fast parallelized DBSCAN algorithm based on OpenMp for detection of criminals on streaming services , ” Front. Big Data , vol. 6 , p. 1292923 , Oct . 2023 , doi: 10.3389/fdata. 2023 . 1292923 .

[4]

Mochurad , “ Implementation and analysis of a parallel kalman filter algorithm for lidar localization based on CUDA technology,” Front . Robot. AI , vol. 11 , p. 1341689 , Feb . 2024 , doi: 10.3389/frobt. 2024 . 1341689 .

[5]

Kovtun ,

Altameem ,

Al-Maitah , and W. Kempa, “ Simple statistical tests selection based parallel computating method ensures the guaranteed global extremum identification , ” J. King Saud Univ. - Sci ., vol. 36 , no. 5 , p. 103165 , May 2024 , doi: 10.1016/j.jksus. 2024 . 103165 .