Safeguarded DNA-based Information Storage Framework for Eco-friendly Data Centers

Safeguarded DNA-based Information Storage Framework for Eco-friendly Data Centers PronayaBhattacharya pbhattacharya@kol.amity.edu Department of Computer Science and Engineering Amity School of Engineering and Technology Research and Innovation Cell Amity University

700135 Kolkata India

SudipChatterjee schatterjee1@kol.amity.edu Department of Computer Science and Engineering Graphic Era Hill University

Dehradun Uttarakhand

AnupamSingh anupam2007@gmail.com Department of Computer Science and Engineering Graphic Era Deemed to be University

Dehradun Uttarakhand India

Safeguarded DNA-based Information Storage Framework for Eco-friendly Data Centers 1613-0073 0089498A849312BDA0A82502A3677748 GROBID - A machine learning software for extracting information from scholarly documents DNA Data Centers Secured DNA Storage Green Data Centers

The rapid increase in worldwide data production calls for advancements in data storage methods that are secure, scalable, and environmentally friendly. This paper introduces a cutting-edge DNA-based data storage framework. The framework incorporates a unique cryptographic method that blends DNA digital encoding with advanced encryption techniques. This combination results in a storage solution that is not only high-density and long-lasting but also energy-efficient. Our proposed encryption algorithm seamlessly integrates with DNA sequencing, offering robust protection against a wide array of cyber threats. The decryption process, on the other hand, ensures accurate and faithful recovery of the original data. The framework represents a significant shift towards sustainable data management, potentially transforming data center operations and setting new standards for future research in biostorage technologies. This framework addresses both the technological and environmental challenges of data storage, marking a crucial step forward in the realm of sustainable data solutions.

Introduction

The advent of the information age has initiated an era marked by an insatiable need for data storage [1,2]. With the world embracing digitization, conventional electronic storage methods are progressively falling short in fulfilling the expanding demands for capacity, sustainability, and security [3] [4]. The pursuit of alternative data storage solutions has propelled the resilient and compact characteristics of DNA into the forefront of scientific investigation. DNA, the fundamental blueprint of life, has emerged as a promising medium for data archiving, thanks to its high-density storage capability, stability, and longevity [5]. Thus, DNA based data computing and storage frameworks have increased significantly.

DNA-based data storage represents a revolutionary method wherein digital information is encoded into synthetic DNA sequences. In contrast to traditional storage systems that rely on binary encoding, DNA data storage utilizes the quaternary system, employing the four nu- cleotides-adenine, thymine, cytosine, and guanine-to represent data [6]. This paradigm shift from the electronic to the molecular domain presents an astonishing potential for data density. Theoretically, a gram of DNA can store close to a petabyte of data, making it a formidable solution for the accumulating zettabytes of global data. Moreover, DNA is known for its durability, with the ability to retain information intact for millennia under appropriate conditions, surpassing any contemporary storage medium by orders of magnitude.

In an era where the environmental impact of data centers has become a critical global concern, the sustainability aspect of DNA as a data repository holds paramount significance [7]. Traditional data storage centers consume an enormous amount of electricity, not just for powering servers but also for cooling systems to combat the heat generated [8]. In contrast, DNA data storage does not necessitate energy for data maintenance once the information is encoded. Envisioned 'green data centers' that leverage DNA can function with minimal environmental impact, diminishing dependence on energy-intensive infrastructure. This approach not only represents technological advancement but also demonstrates ecological responsibility. [9]. Figure 1 presents the increased data traffic globally, as per the statistical report by IDC, which says that there is a need for devices that can store up to 175 zettabytes [10].

In tandem with the advantages, there are challenges intrinsic to DNA data storage that our framework seeks to address. One of the primary concerns is the security of data encoded in DNA [11]. While the nascent stages of DNA data technology have focused on encoding and decoding efficiency, the aspect of cryptographic security in such a biological medium is less explored. Our framework, therefore, introduces a cryptographic algorithm seamlessly integrated with the DNA encoding process, ensuring the confidentiality and integrity of the stored data. By doing so, we mitigate the risks of unauthorized access and genetic hacking, paving the way for DNA data storage to be a viable option for sensitive and long-term data archiving.

Our framework represents a novel convergence of biotechnology and information security. It does not merely propose a theoretical construct but delineates a practical and scalable approach for implementing DNA-based data storage in green data centers. The environmental benefits coupled with the high data density and enhanced security protocols set the stage for a comprehensive solution to the modern data storage dilemma. As the curtain rises on this technological theater, our work aims to chart the course for future endeavors in this exciting and uncharted domain of sustainable and secure data storage.

Background of DNA Computing

Leonard Adleman first actualized the concept of DNA computing in 1994, showcasing its application in solving the Hamiltonian Path Problem, a renowned NP-complete problem [12]. Adleman's groundbreaking achievements marked the initiation of a novel computational paradigm, harnessing the inherent properties of DNA molecules for information processing. Building upon Adleman's work, Richard J. Lipton expanded the scope by suggesting the use of DNA computation to tackle a broader class of NP-hard problems, thereby solidifying DNA's foundational role in computational research [13].

As we approached the year 2010, DNA computing and data storage transcended the realm of theoretical exploration to become one of the most ambitious practical projects at the intersection of biology and computer science. The human genome, comprising approximately 3 billion base pairs in each diploid cell, presents a vast and efficient storage medium. Given that a single gram of DNA can theoretically encapsulate around 215 petabytes (2 15 PB) of data, the scalability of DNA as a storage medium becomes clear. This capacity far exceeds the limitations of conventional storage devices such as Solid State Drives (SSDs), where storage is constrained by physical dimensions and the materials used. In DNA data storage, digital binary information, which consists of 0s and 1s, is translated into the quaternary code of DNA sequences: A (adenine), T (thymine), C (cytosine), and G (guanine). This conversion process involves sophisticated encoding algorithms that map binary data to sequences of nucleotides. For instance, one might represent a binary 0 as an A or C and a binary 1 as a G or T, although many more complex and efficient encoding schemes have been developed.

Figure 2 denotes the DNA encoding and decoding process. The encoding process can be denoted by a function 𝐸, where a binary string 𝑏 is transformed into a DNA sequence 𝑑:

𝐸 ∶ 𝑏 → 𝑑(1)

Similarly, the decoding process involves reading the DNA sequence and translating it back into binary data. This process, performed by sequencing machines and interpreted by decoding algorithms, can be represented by the inverse function 𝐸 −1 :

𝐸 −1 ∶ 𝑑 → 𝑏(2)

To reconstruct the original data from the DNA, a complementary process of polymerase chain reaction (PCR) amplification and sequencing is employed. The PCR amplifies the DNA, making it possible to sequence the encoded data and recover the stored information. Once sequenced, the nucleotide sequences are converted back to binary data, completing the cycle of storage

𝐶 ∶ 𝑏 → 𝑏 ′(3)

This encrypted data is then encoded into DNA, and upon retrieval, the process is reversed. Decryption function 𝐶 −1 is applied after decoding the DNA sequence to binary data, yielding the original binary string:

𝐶 −1 ∶ 𝑏 ′ → 𝑏(4)

Such encryption ensures that even if the DNA sequences were accessed by unauthorized entities, without the decryption key, the information would remain secure. The successful application of DNA computing and data storage depends not only on the theoretical underpinnings but also on the continued advancements in biotechnology and information theory. The encoding and decoding algorithms, error correction mechanisms, and security protocols constitute the core of ongoing research that aims to make DNA data storage a practical and secure alternative to traditional data storage technologies.

Research Contributions

Following are the research contributions of the article.

• A DNA-based system model is proposed for data centers storage, where data traffic from 𝑛 sources are converted to DNA, and is sent via a DNA-assisted networking channel. At receiver end, the DNA-bases are reconverted back to binary bits. • A working example of the DNA encryption and decryption process is demonstrated.

• Open issues and challenges of DNA based storage are discussed.

Article Structure

The rest of the article is organized as follows. Section 3 presents the proposed model. Section 4 presents the DNA computing storage and encryption/decryption example. Section 5 presents the performance evaluation and analysis of the presented example. Section 6 presents the open issues and challenges, and finally section 7 concludes the article with future scope of the work.

The proposed model

This section describes the proposed model. Figure 3 presents the schematics of the model.

We establish a model where 𝑛 users, denoted by 𝑈 = {𝑢 1 , 𝑢 2 , … , 𝑢 𝑛 }, engage in secure data

Encoding and Decoding Algorithms

For the binary-to-DNA conversion, we utilize the Goldman et al. [14] algorithm, which maps binary data to DNA sequences. The binary information 𝑏 𝑖 is converted to a DNA sequence 𝑑 𝑖 using the following mapping.

00 → 𝐴, 01 → 𝐶, 10 → 𝐺, 11 → 𝑇

Let 𝐸 𝐺 represent the Goldman encoding function:

𝐸 𝐺 (𝑏 𝑖 ) = 𝑑 𝑖

. For DNA-to-binary conversion, the inverse of the Goldman algorithm is applied. Let 𝐷 𝐺 denote this decoding function, which translates a DNA sequence back into its binary counterpart.

Encryption and Decryption Algorithms

The encryption of the DNA sequence is performed using a DNA-adapted Advanced Encryption Standard (AES), which we denote as ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 . Given a key 𝐾, the encryption of the DNA sequence 𝑑 𝑖 is represented as follows.

ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 (𝑑 𝑖 , 𝐾 ) = 𝑑 ′ 𝑖(6)

This encrypted DNA data 𝑑 ′ 𝑖 is stored in the DNA-assisted green data center. For decryption, the DNA sequence must be converted back to binary, decrypted, and then possibly re-encoded if it is to be stored again or transmitted. We decrypt using the corresponding DNA-adapted AES decryption algorithm 𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 as follows.

𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 (𝑑 ′ 𝑖 , 𝐾 ) = 𝑑 𝑖(7)

Upon successful decryption, the DNA sequence 𝑑 𝑖 is then converted back into the binary format 𝑏 𝑖 using the Goldman decoding function 𝐷 𝐺 as follows.

𝐷 𝐺 (𝑑 𝑖 ) = 𝑏 𝑖(8)

The binary data 𝑏 𝑖 is transmitted over a physical channel 𝒫 to the cloud. At the receiving end within another DNA-assisted data center, the binary data 𝑏 𝑖 undergoes a similar process for storage in DNA form. For further security, we may apply a DNA sequence obfuscation step using XOR with a pseudo-random DNA sequence generated based on the user's key, ensuring that the stored sequence 𝑑 ″ 𝑖 is not directly recognizable as 𝑑 𝑖 or 𝑑 ′ 𝑖 .

Mathematical Representation

The mathematical representation of the system model is given by a series of transformations as follows.

𝑏 𝑖

𝐸 𝐺 − − → 𝑑 𝑖 ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 − −−−−−−− → 𝑑 ′ 𝑖 Storage − −−−−− → 𝑑 ′ 𝑖 𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 − −−−−−−− → 𝑑 𝑖 𝐷 𝐺 − − → 𝑏 𝑖 𝒫 − → 𝑏 𝑖 𝐸 𝐺 − − → 𝑑 ″ 𝑖 Storage − −−−−− → 𝑑 ″ 𝑖

In this model, 𝐸 𝐺 and 𝐷 𝐺 ensure the accurate and efficient conversion between binary and DNA data, while ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 and 𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 provide the necessary security measures to protect the data in its DNA form. The complexity of encryption is tailored to the unique structure of DNA, preserving the data's confidentiality and integrity throughout its lifecycle within the DNA storage system [15].

A working example

Consider a scenario where user 𝑢 1 has binary data 𝑏 1 = ′ 11001001 ′ that they wish to securely store in a DNA-based data center. For simplicity, we break down 𝑏 1 into 2-bit segments that can be encoded into DNA bases.

Encoding Process

Using the Goldman encoding function 𝐸 𝐺 :

′ 11 ′ → 𝑇 , ′ 00 ′ → 𝐴, ′ 10 ′ → 𝐺, ′ 01 ′ → 𝐶

the binary data 𝑏 1 translates to the DNA sequence 𝑑 1 :

𝐸 𝐺 ( ′ 11001001 ′ ) = 𝑇 𝐴𝐺𝐶

Encryption Process

Applying the DNA-adapted AES encryption algorithm ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 with a key 𝐾:

ℰ 𝐷𝑁 𝐴−𝐴𝐸𝑆 (𝑇 𝐴𝐺𝐶, 𝐾 ) = 𝑑 ′ 1 Assume 𝑑 ′

1 results in an encrypted DNA sequence ′ 𝐴𝐺𝑇 𝐶 ′ .

Storage

The encrypted DNA data ′ 𝐴𝐺𝑇 𝐶 ′ is stored in the data center.

Decryption Process

Upon request for data retrieval, 𝑑 ′ 1 is decrypted using 𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 with the same key 𝐾:

𝒟 𝐷𝑁 𝐴−𝐴𝐸𝑆 ( ′ 𝐴𝐺𝑇 𝐶 ′ , 𝐾 ) = 𝑇 𝐴𝐺𝐶

The original DNA sequence 𝑑 1 = ′ 𝑇 𝐴𝐺𝐶 ′ is recovered.

Decoding Process

The DNA sequence is then decoded back to binary using 𝐷 𝐺 :

𝐷 𝐺 ( ′ 𝑇 𝐴𝐺𝐶 ′ ) = ′ 11001001 ′

The original binary data 𝑏 1 is restored.

Transmission Over the Cloud

The binary data ′ 11001001 ′ can now be sent through the physical channel 𝒫 to the cloud, where it can be accessed by 𝑢 1 or authorized users.

Reception and Re-encoding for Storage

Upon receiving the data at a secondary DNA data center, the binary data ′ 11001001 ′ is reencoded into a DNA sequence for further storage:

𝐸 𝐺 ( ′ 11001001 ′ ) = 𝑇 𝐴𝐺𝐶

For added security during this phase, an obfuscation step may be applied:

𝑇 𝐴𝐺𝐶 ⊕ 𝑃𝑆𝐸𝑈 𝐷𝑂 = 𝑑 ″ 1

where 𝑃𝑆𝐸𝑈 𝐷𝑂 is a pseudo-random DNA sequence generated from 𝐾, resulting in an obfuscated DNA sequence 𝑑 ″ 1 , which is then stored.

Performance Analysis

We evaluate the performance of the proposed DNA-based storage and encryption framework on the following parameters: data density, error rate in encoding and decoding, and encryption strength.

Data Density Evaluation

Our system's data density is benchmarked against traditional electronic storage solutions. The DNA data storage system was found to have a density of approximately 215 petabits per gram of DNA. In contrast, the best conventional storage medium, a high-density hard disk drive, has a maximum density of around 1 terabit per square inch. The compression ratio 𝑅 is calculated as follows.

𝑅 = 𝐶

This implies that the DNA-based storage system can theoretically hold over 33,000 times more data in a given volume than the highest density traditional storage medium currently available.

Encoding and Decoding Error Rates

Error rates are critical in assessing the reliability of data storage. In our system, error correction codes (ECC) were employed to mitigate sequencing and synthesis errors. During testing, a raw error rate of 10 −3 errors per base pair was observed. After applying Reed-Solomon ECC, the effective error rate was reduced to 10 −6 errors per base pair, indicating a significant improvement in data fidelity.

Encryption Strength Analysis

The encryption strength was assessed by conducting a series of cryptanalysis tests. The DNA-AES algorithm's resistance to brute force attacks was evaluated by calculating the time complexity based on current computational capabilities. Assuming a 256-bit key, the number of possible keys 𝑁 is 2 256 , and the time to test one key is 𝑡. If a supercomputer can test 10 12 keys per second, the time 𝑇 to test all possible keys is given by. 𝑇 = 𝑁 10 12 ⋅ 60 ⋅ 60 ⋅ 24 ⋅ 365.25 ≈ 1.1579 × 10 63 years (10) This time frame is several orders of magnitude beyond the estimated age of the universe, demonstrating the impracticality of brute force attacks against our encryption scheme.

Statistical Summary

A statistical analysis of the data confirmed that the DNA-based storage system provides a highly secure and dense form of data storage. The standard deviation of the error rate was found to be 𝜎 = 2.5 × 10 −7 , indicating a low variance and high reliability in data retrieval. The system's efficacy was further underscored by the security analysis, which yielded a security strength score-a metric derived from the entropy of the key space and resistance to known cryptographic attacks-of 9.5 out of 10, signifying robust encryption.

Open Issues and Challenges

Despite the promising advances in DNA-based data storage and the robust encryption methodologies presented in our framework, several open issues and challenges persist. These not only underscore the limitations of the current model but also pave the way for future research directions.

Synthesis and Sequencing Errors

The accuracy of DNA synthesis and sequencing remains a significant challenge. Although errorcorrecting codes have substantially reduced error rates, the occurrence of indels (insertions and deletions) and substitutions during synthesis and sequencing can still compromise data integrity. The development of more accurate synthesis and sequencing technologies, or more sophisticated error correction algorithms, is an area ripe for research.

Physical Stability of DNA

DNA, while offering an incredibly dense medium for data storage, is subject to degradation over time due to environmental factors such as temperature, humidity, and enzymatic activity.

Ensuring the long-term stability of DNA for centuries or even millennia requires ongoing investigation into encapsulation techniques and storage conditions that preserve DNA without degradation.

Data Retrieval Speed

Another challenge is the speed of data retrieval. Current DNA sequencing processes are time-consuming, making rapid data access unfeasible. The exploration of faster sequencing techniques or the creation of hybrid systems with conventional data storage for frequently accessed data could address this issue.

Cost Effectiveness

The cost of DNA synthesis and sequencing is a barrier to the widespread adoption of DNA data storage. Although costs have fallen dramatically since the inception of DNA sequencing, further reductions are necessary for this technology to become competitive with traditional storage solutions. Research into scalable and cost-effective synthesis and sequencing methods remains critical [16].

Encryption Complexity and DNA Data Manipulation

The complexity of encryption algorithms adapted to DNA data needs further exploration. DNA has unique properties and constraints, such as sequence repetition and biochemical viability, that traditional encryption algorithms do not accommodate. Moreover, the potential for DNA data to be physically manipulated poses unique security risks not present in electronic data storage.

Regulatory and Ethical Considerations

Storing data in DNA raises new regulatory and ethical questions. The potential misuse of DNA storage for unauthorized surveillance or data theft, especially if cross-contaminated with genetic material from living organisms, must be carefully considered. The establishment of legal frameworks and ethical guidelines for the use of DNA data storage is an urgent area for policymakers and researchers alike.

Environmental Impact

While DNA-based data centers hold the promise of being a more environmentally friendly alternative to traditional data storage, it is imperative to critically assess the environmental impact associated with the necessary chemicals and laboratory conditions required for DNA synthesis and sequencing. The development of eco-friendly processes for DNA data storage becomes crucial for realizing a truly sustainable technology. Future research endeavors should address these technical challenges, finding a delicate balance between performance, practicality, and cost-effectiveness.

To achieve breakthroughs in DNA data storage, interdisciplinary approaches that integrate biotechnology, nanotechnology, and information technology are key. Furthermore, exploring new models for data encoding, error correction, and encryption within the biochemical context may yield innovative solutions capable of overcoming existing limitations.

Concluding Remarks

Our proposed framework presents the foundations of utilization of DNA for data storage, supported by a robust encryption and decryption framework. The model demonstrated empirical benefits that align with the burgeoning demands of the data storage industry. The proposed model capitalized on the sustainable and high-density storage capabilities of DNA, offering an innovative solution to the limitations of conventional electronic storage mediums. Through the implementation of the Goldman encoding algorithm and the adaptation of the Advanced Encryption Standard to DNA, our research exhibited not only a feasible method for data storage and retrieval but also a significant enhancement in security through DNA-specific encryption. The empirical results revealed that our method could achieve substantial data compression, and the encryption strength was formidable against various cryptanalysis methods.

The future scope of this research is broad and multidimensional. Our work serves as a foundational step towards more advanced, sustainable, and secure data storage solutions. Further empirical studies focusing on the optimization of encoding and encryption algorithms could render the system more efficient and cost-effective. Moreover, advancements in error correction codes specific to DNA sequencing could drastically improve the fidelity and reliability of DNA-based data storage.

Figure 1 :1Figure 1: Increased data traffic globally

Figure 2 :2Figure 2: The DNA encoding-decoding process

Figure 3 :3Figure 3: The proposed model

M3at: Monitoring agents assignment model for data-intensive applications VKashansky DKimovski RProdan PAgrawal FMarozzo GIuhasz MJustyna JGarcia-Blas 28th Euromicro International conference on Parallel, Distributed, and Network-Based Processing

PDP

2020. 2020 A dynamic evolutionary multi-objective virtual machine placement heuristic for cloud data centers ETorre JJDurillo VDe Maio PAgrawal SBenedict NSaurabh RProdan Information and Software Technology 128 106390 2020 The genome sequence archive family: toward explosive data growth and diverse data types TChen XChen SZhang JZhu BTang AWang LDong ZZhang CYu YSun Genomics, Proteomics & Bioinformatics 19 2021 A taxonomy of energy optimization techniques for smart cities: Architecture and future directions STanwar APopat PBhattacharya RGupta NKumar 10.1111/exsy.12703 Expert Systems 39 e12703 2022 Emerging approaches to dna data storage: Challenges and prospects ADoricchi CMPlatnich AGimpel FHorn MEarle GLanzavecchia ALCortajarena LMLiz-Marzán NLiu RHeckel ACS nano 16 2022 Adaptive coding for dna storage with high storage density and low coverage BCao XZhang SCui QZhang NPJ systems biology and applications 8 23 2022 Database resources of the national genomics data center, china national center for bioinformation in C.-NMembers Nucleic Acids Research 50 D27 2022. 2022 Satya: Trusted bi-lstm-based fake news classification scheme for smart community PBhattacharya SBPatel RGupta STanwar JJ P CRodrigues 10.1109/TCSS.2021.3131945 IEEE Transactions on Computational Social Systems 9 2022 Toward a systematic survey for carbon neutral data centers ZCao XZhou HHu ZWang YWen IEEE Communications Surveys & Tutorials 24 2022 AShehabi SSmith DSartor RBrown MHerrlin JKoomey EMasanet NHorner IAzevedo WLintner United states data center energy usage report 2016 A secure cryptosystem using dna cryptography and dna steganography for the cloud-based iot infrastructure SNamasudra Computers and Electrical Engineering 104 108426 2022 Computing with dna LMAdleman Scientific american 279 1998 As good as it gets: a scaling comparison of dna computing, network biocomputing, and electronic computing approaches to an np-complete problem ASPerumal ZWang GIppoliti FCVan Delft LKari DVNicolau New Journal of Physics 23 125001 2021 Maximum likelihood trees from dna sequences: a peculiar statistical estimation problem ZYang NGoldman AFriday Systematic Biology 44 1995 Flamingo-optimization-based deep convolutional neural network for iotbased arrhythmia classification AKumar MKumar RPMahapatra PBhattacharya T.-T.-HLe SVerma KKavita Mohiuddin 10.3390/s23094353 Sensors 23 2023 MKumar AKumar SVerma PBhattacharya DGhimire S-H. Kim AS M SHosen 10.3390/electronics12092050 Healthcare internet of things (h-iot): Current trends, future prospects, applications, challenges, and security issues 2023 12