=Paper=
{{Paper
|id=Vol-1353/paper_10
|storemode=property
|title=Hiding Color Images in DNA Sequences
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_10.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/BeckY15
}}
==Hiding Color Images in DNA Sequences==
Hiding Color Images in DNA Sequences Marc B. Beck*, Roman V. Yampolskiy Cybersecurity Lab, Department of Computer Engineering and Computer Science, Speed School of En- gineering, University of Louisville, Louisville, KY 40292 {mbbeck05,roman.yampolskiy}@Louisville.edu Abstract Stretton, & Kaplan, 1965; Martin et al., 1962) encodes for one of the twenty amino acids. This means that multiple With recent advances in genetic engineering it has become possi- codons encode the same amino acid, allowing for degener- ble to embed artificial DNA strands into the living cells of organ- acy. isms. With DNA having a great capability for data storage, many It has become possible through recent advances in genet- methods have been developed to insert artificial information into a DNA sequence. However, most of these methods focus on the ic engineering to insert artificial DNA sequences into the encoding of text data and little research has been done regarding living cells of organisms (Gibson et al., 2010). This allows the encoding of other media. Few methods have been researched the insertion of information into the DNA strands of organ- to encode images and most of those are only for black-and-white isms such as bacteria for applications like data storage, wa- images. We are proposing an algorithm to insert and extract color termarking, or communication of secret messages. images in the form of bitmap files into a DNA sequence in the form of a FASTA file. Results from our experiments show that the proposed method is significantly more efficient than previous DNA as Storage Medium approaches. Researchers are investigating DNA as an ultra-compact, long-term data storage medium (Church, Gao, & Kosuri, Introduction 2012; Goldman et al., 2013; Wong, Wong, & Foote, 2003) as well as a stegomedium for hiding information (Smith, Deoxyribose Nucleic Acid (DNA), the molecule that car- Fiddes, Hawkins, & Cox, 2003). Messages are expressed ries the hereditary information for every living organism, is as a series of As, Cs, Gs, and Ts in DNA code as opposed a double helix with two anti-parallel strands. These strands to ones and zeroes in binary code. In order to encode mes- contain four different nucleotides which are distinguished sages in DNA, researchers have developed various algo- by the bases adenine (A), cytosine (C), guanine (G), and rithms that can either insert a message into an existing thiamine (T). Using combinations of those four nucleo- DNA sequence (Jiao, 2009), or disguise it as a new one. tides, DNA has the potential to store vast amounts of data These artificial DNA strands can be inserted into the ge- within genomes, the length of which can range up to sever- nomes of living organisms, which has been proven possi- al billion bases (Anam, Sakib, Hossain, & Dahal, 2010). ble by Venter (Gibson et al., 2010) and others (Jiao, 2009), Certain regions within a genomic sequence translate into (Arita, 2004; Brenner et al., 2000; Heider & Barnekow, genes that produce proteins consisting of amino acids. 2008; Jiao & Goutte, 2008; Yachie, Sekiyama, Sugahara, Marshall Nirenberg (Martin, Matthaei, Jones, & Nirenberg, Ohashi, & Tomita, 2007)]. 1962) first discovered the genetic code, which dictates how The first cell with a synthetic genome was created in the combinations of the four bases of DNA are translated 2010 by Craig Venter, who led the private effort to se- into twenty amino acids. quence the human genome. Venter and his team at the J. A sequence of three nucleotides that determines which Craig Venter Institute (JCVI) modified a computer file of amino acid will be incorporated next during protein syn- the DNA sequence of the bacterium Mycoplasma my- thesis is called a codon. The four nucleotides can be com- coides. They then produced physical DNA from this se- bined into 43=64 unique codons. Each codon, except the quence, which they inserted into a cell. This cell repro- three STOP codons TAA, TAG, and TGA (Brenner, duced under control of the artificial DNA to create a new bacterium [4]. Copyright held by the author(s). High density, redundancy, and a long life expectancy are Insertion of Media other than Text some of the many advantages of using DNA as a data stor- age medium. Most research in DNA steganography focuses on hiding text and only very little research has been done so far on hiding other media in a DNA sequence. Goldman et al. DNA Steganography (Goldman et al., 2013) describe encoding five files of vari- ous types in a DNA sequence. These files include a JPEG Steganography is the science of transmitting a message 2000 image and a speech in MP3 format. The coding hidden inside a cover medium in a way that no one other scheme they used utilizes several intermediate steps. First, than the sender and the intended recipient suspect its exist- the image file and the sound file are translated into binary. ence. The goal of steganography is to avoid suspicion to Then, the text file and the binary data from the other files the existence of the message, while cryptography merely is translated into a base-3 code and finally into sequences aims at making a message unreadable (Wong et al., 2003). of DNA bases. Its properties as a data storage medium also make DNA a good medium for steganography. Davis (Davis, 1996) describes a method of encoding the Noncoding genomic regions may seem to be an obvious black-and-white image of a relatively simple shape (5 by 7 choice of locations for inserting messages. However, the bitmap) into a 35 bit binary sequence, which was then biological purpose of these regions is not yet fully under- compressed. His approach compares the molecular weights stood [11] and tampering with them might possibly cause of the bases to obtain an incremental reference. Starting the death of the organism. with the smallest base, Cytosine, Davis assigns numbers to Arita et al. (Arita, 2004) suggested to encode the mes- the bases in ascending order. This results in C = 1, T =2, A sage in the protein coding regions of genes instead. With =3, and G =4. This method compresses the binary digits of 20 amino acids and one stop symbol using a total of 64 the bit-mapped image into fewer DNA base symbols by us- possible codons (Arita & Ohashi, 2004), there is redundan- ing each base to indicate how many times each binary val- cy where often two or more codons code for the same ami- ue (0 or 1) is to be repeated before changing to the respec- no acid. This redundancy can be used to embed messages, tive other value. This technique is widely used in data since many of these redundant, or synonymous, codons compression. This can be represented as shown in Table 1. typically differ in their third position, also known as the Using this coding method, the thirty-five-bit black-and- wobble base (Brenner et al., 1965). This means that chang- white image is translated to only eighteen DNA bases: ing the wobble base to hide messages will not affect the CCCCCCAACGCGCGCGCT coded amino acid. These can be decoded to yield one of the two following bi- nary sequences: 10101011100010000100001000010000100 Coding Schemes for Hiding Text in DNA or A code is a cryptographic rule that determines which sym- bol from a target alphabet uniquely represents which sym- 01010100011101111011110111101111011 bol from a source alphabet. In DNA Steganography, the This depends on if either a 1 or a 0 is chosen to start the source alphabet is made of alphanumeric characters the decoding sequence. Transforming either of the two se- case of text information, or color values of pixels in the quences into the correct five-by-seven matrix will produce case of images. The target alphabet consists of various the image. Since the example used by Davis is bilaterally combinations of the initials of the four nucleotides. symmetrical, more than one of several possible five-by Which symbol in the target alphabet is chosen to repre- seven matrices will in this case result in producing the cor- sent which symbol in the source alphabet is determined by rect bitmap [22]. a set of rules called a coding scheme. We have already compared several existing coding schemes in an earlier publication (Beck, Rouchka, & Yampolskiy, 2013). Simple substitution ciphers are the most common ones [7, (Clelland, Risca, & Bancroft, 1999)]. Other, more sophisticated coding schemes exist, which are either geared toward error detection and correc- tion [(Arita & Ohashi, 2004), (Heider & Barnekow, 2007)], or optimization of the available capacity to hide messages (Huffman, 1952), (Ailenberg & Rotstein, 2009). Table 1. Coding scheme used by Davis [22] Table 3. Translation of RGB values into DNA bases Base Bit sequence RGB Base C 1 or 0 0-63 A T 11 or 00 A 111 or 000 64-127 C G 1111 or 0000 128-191 G 192-255 T Ailenberg and Rotstein (Ailenberg & Rotstein, 2009) have developed a coding scheme to encode an image that is composed of shapes and their coordinates. This way of encoding an image is not very efficient. A Results more feasible approach has been described by Yokoo and Each codon encodes one pixel and the coordinates of the Oshima (Yokoo & Oshima, 1978). This approach suggests codon in the array will be the coordinates of the pixel in to arrange the 3-base codons of a DNA sequence in a two the resulting bitmap. The following example shows each dimensional array and then translate one base of each co- step of the encoding process: don into either black or white, with G and C being black and A and T being white, or vice versa. This is done for DNA sequence: each base of all the codons, which would result in three ATA TAA TAA TAA TTA AAT AAA TTT AAA ATA separate images. AAT TTT GAG TTT ATA AAT AAA TTT AAA ATA Hennings and Kettelberger (Hennings & Kettelberger, TAA TTA TTA TTA AAT 2004) have developed a method to generate music by de- coding and transcribing genetic information within a DNA DNA sequence as two dimensional array: sequence into a music signal having melody and harmony. ATA TAA TAA TAA TTA AAT AAA TTT AAA ATA AAT TTT GAG TTT ATA AAT AAA TTT AAA ATA Methodology TAA TTA TTA TTA AAT We have developed a very similar coding scheme to the one described by Yokoo and Oshima [23], with the differ- The array is created by taking the square root of the ence that we use all three bases of each codon for encoding number of codons in the DNA sequence. The result is color information instead of creating three separate images. rounded up to give the width and rounded down to give the Our approach will determine the width and height of the height of the image. The two numbers are multiplied and if array used for creating the image using the two closest fac- the result is less than the number of codons, the smaller tors of the number of codons. This will result in a picture number is increased by 1. This will result in an array that is that is as close to a square in shape as possible. large enough for all codons, in some cases slightly larger. The DNA sequence is arranged in a two dimensional ar- The extra space will be filled with white pixels in the re- ray the same way as described by Yokoo and Oshima sulting image. (Yokoo & Oshima, 1978), but in our case the first base of Table 4. After translation into RGB: each codon is used to encode the red portion, the second base for the green portion, and the third for the blue portion 0,255,0 255,0,0 255,0,0 255,0,0 255,255,0 of each pixel. DNA bases are translated into RGB values 0,0,255 0,0,0 255,255,255 0,0,0 0,255,0 using the following coding tables: 0,0,255 255,255,25 127,0,127 255,255,2 0,255,0 5 55 Table 2. Translation of DNA bases to RGB values 0,0,255 0,0,0 255,255,255 0,0,0 0,255,0 255,0,0 255,255,0 255,255,0 255,255,0 0,0,255 Base RGB A 0 C 64 G 128 T 255 This method allows the encoding of 64 colors and en- sures that the encoding of all the most common colors such as red, green, blue, yellow, magenta, orange, grey, black, and white is possible. Fig. 1. Resulting image (enlarged by factor 16) Future research and conclusion The use of only 64 colors obviously leads to the loss of color information. Also, with the current algorithm the program assumes that the width and height of an image are as similar (a square, or approximately a square) as possi- ble. For example, a 120x40 pixel image would be decoded as a 60x80 pixel image. A possible solution would be to encode the dimensions of the image as well. Our method is simpler and more storage space efficient than the one de- scribed by Goldman [6], but as a tradeoff can encode fewer colors. It is also more specialized toward images, while Goldman’s approach is geared toward a variety of data types. Further research could lead to the development of algorithms to detect, extract and decode images that have been hidden in DNA sequences. These methods could be used for forensic purposes. Similar algorithms have already been developed for text-based DNA Steganalysis (Beck, Desoky, Rouchka, & Yampolskiy, 2014). References [16] Hennings, M. R., & Kettelberger, D. M. (2004). United [1] Ailenberg, M., & Rotstein, O. (2009). An improved Huffman States of America Patent No.: U. S. P. T. Office. coding method for archiving text, images, and music [17] Huffman, D. A. (1952). A Method for the Construction of characters in DNA. Biotechniques, 47(3), 747-754. doi: Minimum-Redundancy Codes. Proceedings of the IRE, 10.2144/000113218 40(9), 1098 - 1101. doi: [2] Anam, B., Sakib, K., Hossain, A., & Dahal, K. (2010). 10.1109/JRPROC.1952.273898 Review on the Advancements of DNA Cryptography. [18] Jiao, S.-H. (2009). Hiding data in DNA of living organisms. Paper presented at the International conference on Natural Science, 01(03), 181-184. doi: Software, Knowledge, Information Management and 10.4236/ns.2009.13023 Application, Paro, Bhutan. [19] Jiao, S.-H., & Goutte, R. (2008). Code for Encryption Hiding [3] Arita, M. (2004). Comma-free design for DNA words. Data into Genomic DNA. Paper presented at the 9th Communications of the ACM, 47(5), 99. doi: International Conference on Signal Processing, Beijing, 10.1145/986213.986244 China. [4] Arita, M., & Ohashi, Y. (2004). Secret Signatures Inside [20] Martin, R. G., Matthaei, J. H., Jones, O. W., & Nirenberg, Genomic DNA. Biotechnology Progress, 20(5). M. W. (1962). Ribonucleotide composition of the [5] Beck, M. B., Desoky, A. H., Rouchka, E. C., & Yampolskiy, genetic code. Biochemical and biophysical research R. V. (2014). Decoding Methods for DNA communications, 6(6), 410-414. Steganalysis. Paper presented at the Paper presented at [21] Smith, G. C., Fiddes, C. C., Hawkins, J. P., & Cox, J. P. L. the 6th International Conference on Bioinformatics and (2003). Some possible codes for encrypting data in Computational Biology (BICoB 2014), Las Vegas, DNA. Biotechnology Letters, 25(14), 1125-1130. Nevada, USA. [22] Wong, P. C., Wong, K.-K., & Foote, H. (2003). ORGANIC [6] Beck, M. B., Rouchka, E. C., & Yampolskiy, R. V. (2013). DATA MEMORY Using the DNA Approach. Finding Data in DNA: Computer Forensic Communications of the ACM, 46(1), 95-98. Investigations of Living Organisms. Lecture Notes of [23] Yachie, N., Sekiyama, K., Sugahara, J., Ohashi, Y., & the Institute for Computer Sciences, Social Informatics Tomita, M. (2007). Alignment-Based Approach for and Telecommunication Engineering, 114(2013), 204- Durable Data Storage into Living Organisms. 219. Biotechnology Progress 2007 Mar-Ap, 23(2), 501-505. [7] Brenner, S., Stretton, A. O., & Kaplan, S. (1965). Genetic [24] Yokoo, H., & Oshima, T. (1978). Is Bacteriophage X174 Code: The 'Nonsense' Triplets for Chain Termination DNA a Message from Extraterrestrial Intelligence? and their Suppression. Nature, 05 June 1965; 206(988), Icarus, 38(1). 994-998. [8] Brenner, S., Williams, S. R., Vermaas, E. H., Storck, T., Moon, K., & McCollum, C. (2000). In vitro cloning of complex mixtures of DNA on microbeads: Physical separation of differentially expressed cDNAs. Proceedings of the National Academy of Sciences of the United States of America, 97(4), 1665-1670. [9] Church, G. M., Gao, Y., & Kosuri, S. (2012). Next- Generation Digital Information Storage in DNA. Science 28 September 2012, 337(6102),1628.doi: 10.1126/science.293.5536.1763c [10] Clelland, C. T., Risca, V., & Bancroft, C. (1999). Hiding messages in DNA microdots.pdf. Nature, 399(10), 533- 534. [11] Davis, J. (1996). Microvenus. Art Journal, 55(1), 70-74. [12] Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., . . . Venter, J. C. (2010). Creation of a bacterial cell controlled by a chemically synthesized genome. Science, 329(5987), 52-56. doi: 10.1126/science.1190719 [13] Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E. M., Sipos, B., & Birney, E. (2013). Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435), 77-80. doi: 10.1038/nature11875 [14] Heider, D., & Barnekow, A. (2007). DNA-based watermarks using the DNA-Crypt algorithm. BMC Bioinformatics, 8, 176. doi: 10.1186/1471-2105-8-176 [15] Heider, D., & Barnekow, A. (2008). DNA watermarks: a proof of concept. BMC Mol Biol, 9, 40. doi: 10.1186/1471-2199-9-40