=Paper=
{{Paper
|id=Vol-1353/paper_10
|storemode=property
|title=Hiding Color Images in DNA Sequences
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_10.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/BeckY15
}}
==Hiding Color Images in DNA Sequences==
<pdf width="1500px">https://ceur-ws.org/Vol-1353/paper_10.pdf</pdf>
<pre>
                               Hiding Color Images in DNA Sequences

                                            Marc B. Beck*, Roman V. Yampolskiy

 Cybersecurity Lab, Department of Computer Engineering and Computer Science, Speed School of En-
                      gineering, University of Louisville, Louisville, KY 40292
                                               {mbbeck05,roman.yampolskiy}@Louisville.edu


                               Abstract                               Stretton, & Kaplan, 1965; Martin et al., 1962) encodes for
                                                                      one of the twenty amino acids. This means that multiple
With recent advances in genetic engineering it has become possi-      codons encode the same amino acid, allowing for degener-
ble to embed artificial DNA strands into the living cells of organ-   acy.
isms. With DNA having a great capability for data storage, many
                                                                         It has become possible through recent advances in genet-
methods have been developed to insert artificial information into
a DNA sequence. However, most of these methods focus on the
                                                                      ic engineering to insert artificial DNA sequences into the
encoding of text data and little research has been done regarding     living cells of organisms (Gibson et al., 2010). This allows
the encoding of other media. Few methods have been researched         the insertion of information into the DNA strands of organ-
to encode images and most of those are only for black-and-white       isms such as bacteria for applications like data storage, wa-
images. We are proposing an algorithm to insert and extract color     termarking, or communication of secret messages.
images in the form of bitmap files into a DNA sequence in the
form of a FASTA file. Results from our experiments show that
the proposed method is significantly more efficient than previous                   DNA as Storage Medium
approaches.
                                                                      Researchers are investigating DNA as an ultra-compact,
                                                                      long-term data storage medium (Church, Gao, & Kosuri,
                           Introduction                               2012; Goldman et al., 2013; Wong, Wong, & Foote, 2003)
                                                                      as well as a stegomedium for hiding information (Smith,
Deoxyribose Nucleic Acid (DNA), the molecule that car-
                                                                      Fiddes, Hawkins, & Cox, 2003). Messages are expressed
ries the hereditary information for every living organism, is
                                                                      as a series of As, Cs, Gs, and Ts in DNA code as opposed
a double helix with two anti-parallel strands. These strands          to ones and zeroes in binary code. In order to encode mes-
contain four different nucleotides which are distinguished            sages in DNA, researchers have developed various algo-
by the bases adenine (A), cytosine (C), guanine (G), and              rithms that can either insert a message into an existing
thiamine (T). Using combinations of those four nucleo-                DNA sequence (Jiao, 2009), or disguise it as a new one.
tides, DNA has the potential to store vast amounts of data
                                                                      These artificial DNA strands can be inserted into the ge-
within genomes, the length of which can range up to sever-
                                                                      nomes of living organisms, which has been proven possi-
al billion bases (Anam, Sakib, Hossain, & Dahal, 2010).               ble by Venter (Gibson et al., 2010) and others (Jiao, 2009),
   Certain regions within a genomic sequence translate into           (Arita, 2004; Brenner et al., 2000; Heider & Barnekow,
genes that produce proteins consisting of amino acids.                2008; Jiao & Goutte, 2008; Yachie, Sekiyama, Sugahara,
Marshall Nirenberg (Martin, Matthaei, Jones, & Nirenberg,             Ohashi, & Tomita, 2007)].
1962) first discovered the genetic code, which dictates how
                                                                         The first cell with a synthetic genome was created in
the combinations of the four bases of DNA are translated              2010 by Craig Venter, who led the private effort to se-
into twenty amino acids.                                              quence the human genome. Venter and his team at the J.
   A sequence of three nucleotides that determines which              Craig Venter Institute (JCVI) modified a computer file of
amino acid will be incorporated next during protein syn-              the DNA sequence of the bacterium Mycoplasma my-
thesis is called a codon. The four nucleotides can be com-            coides. They then produced physical DNA from this se-
bined into 43=64 unique codons. Each codon, except the
                                                                      quence, which they inserted into a cell. This cell repro-
three STOP codons TAA, TAG, and TGA (Brenner,                         duced under control of the artificial DNA to create a new
                                                                      bacterium [4].
Copyright held by the author(s).
  High density, redundancy, and a long life expectancy are             Insertion of Media other than Text
some of the many advantages of using DNA as a data stor-
age medium.                                                     Most research in DNA steganography focuses on hiding
                                                                text and only very little research has been done so far on
                                                                hiding other media in a DNA sequence. Goldman et al.
                 DNA Steganography                              (Goldman et al., 2013) describe encoding five files of vari-
                                                                ous types in a DNA sequence. These files include a JPEG
Steganography is the science of transmitting a message
                                                                2000 image and a speech in MP3 format. The coding
hidden inside a cover medium in a way that no one other
                                                                scheme they used utilizes several intermediate steps. First,
than the sender and the intended recipient suspect its exist-
                                                                the image file and the sound file are translated into binary.
ence. The goal of steganography is to avoid suspicion to
                                                                Then, the text file and the binary data from the other files
the existence of the message, while cryptography merely         is translated into a base-3 code and finally into sequences
aims at making a message unreadable (Wong et al., 2003).        of DNA bases.
Its properties as a data storage medium also make DNA a
good medium for steganography.                                  Davis (Davis, 1996) describes a method of encoding the
   Noncoding genomic regions may seem to be an obvious          black-and-white image of a relatively simple shape (5 by 7
choice of locations for inserting messages. However, the        bitmap) into a 35 bit binary sequence, which was then
biological purpose of these regions is not yet fully under-     compressed. His approach compares the molecular weights
stood [11] and tampering with them might possibly cause         of the bases to obtain an incremental reference. Starting
the death of the organism.                                      with the smallest base, Cytosine, Davis assigns numbers to
   Arita et al. (Arita, 2004) suggested to encode the mes-      the bases in ascending order. This results in C = 1, T =2, A
sage in the protein coding regions of genes instead. With       =3, and G =4. This method compresses the binary digits of
20 amino acids and one stop symbol using a total of 64          the bit-mapped image into fewer DNA base symbols by us-
possible codons (Arita & Ohashi, 2004), there is redundan-      ing each base to indicate how many times each binary val-
cy where often two or more codons code for the same ami-        ue (0 or 1) is to be repeated before changing to the respec-
no acid. This redundancy can be used to embed messages,         tive other value. This technique is widely used in data
since many of these redundant, or synonymous, codons            compression. This can be represented as shown in Table 1.
typically differ in their third position, also known as the     Using this coding method, the thirty-five-bit black-and-
wobble base (Brenner et al., 1965). This means that chang-      white image is translated to only eighteen DNA bases:
ing the wobble base to hide messages will not affect the        CCCCCCAACGCGCGCGCT
coded amino acid.
                                                                These can be decoded to yield one of the two following bi-
                                                                nary sequences:
                                                                10101011100010000100001000010000100
   Coding Schemes for Hiding Text in DNA
                                                                or
A code is a cryptographic rule that determines which sym-
bol from a target alphabet uniquely represents which sym-       01010100011101111011110111101111011
bol from a source alphabet. In DNA Steganography, the           This depends on if either a 1 or a 0 is chosen to start the
source alphabet is made of alphanumeric characters the          decoding sequence. Transforming either of the two se-
case of text information, or color values of pixels in the      quences into the correct five-by-seven matrix will produce
case of images. The target alphabet consists of various         the image. Since the example used by Davis is bilaterally
combinations of the initials of the four nucleotides.           symmetrical, more than one of several possible five-by
   Which symbol in the target alphabet is chosen to repre-      seven matrices will in this case result in producing the cor-
sent which symbol in the source alphabet is determined by       rect bitmap [22].
a set of rules called a coding scheme.
   We have already compared several existing coding
schemes in an earlier publication (Beck, Rouchka, &
Yampolskiy, 2013). Simple substitution ciphers are the
most common ones [7, (Clelland, Risca, & Bancroft,
1999)]. Other, more sophisticated coding schemes exist,
which are either geared toward error detection and correc-
tion [(Arita & Ohashi, 2004), (Heider & Barnekow, 2007)],
or optimization of the available capacity to hide messages
(Huffman, 1952), (Ailenberg & Rotstein, 2009).
       Table 1. Coding scheme used by Davis [22]                   Table 3. Translation of RGB values into DNA bases
                Base     Bit sequence                                                    RGB            Base
                  C        1 or 0                                                         0-63            A
                  T        11 or 00
                  A        111 or 000                                                    64-127           C
                  G        1111 or 0000                                                  128-191          G
                                                                                         192-255          T
   Ailenberg and Rotstein (Ailenberg & Rotstein, 2009)
have developed a coding scheme to encode an image that is
composed of shapes and their coordinates.
   This way of encoding an image is not very efficient. A        Results
more feasible approach has been described by Yokoo and           Each codon encodes one pixel and the coordinates of the
Oshima (Yokoo & Oshima, 1978). This approach suggests            codon in the array will be the coordinates of the pixel in
to arrange the 3-base codons of a DNA sequence in a two          the resulting bitmap. The following example shows each
dimensional array and then translate one base of each co-        step of the encoding process:
don into either black or white, with G and C being black
and A and T being white, or vice versa. This is done for         DNA sequence:
each base of all the codons, which would result in three         ATA TAA TAA TAA TTA AAT AAA TTT AAA ATA
separate images.                                                 AAT TTT GAG TTT ATA AAT AAA TTT AAA ATA
   Hennings and Kettelberger (Hennings & Kettelberger,           TAA TTA TTA TTA AAT
2004) have developed a method to generate music by de-
coding and transcribing genetic information within a DNA         DNA sequence as two dimensional array:
sequence into a music signal having melody and harmony.          ATA TAA TAA TAA TTA
                                                                 AAT AAA TTT AAA ATA
                                                                 AAT TTT GAG TTT ATA
                                                                 AAT AAA TTT AAA ATA
Methodology                                                      TAA TTA TTA TTA AAT
We have developed a very similar coding scheme to the
one described by Yokoo and Oshima [23], with the differ-            The array is created by taking the square root of the
ence that we use all three bases of each codon for encoding      number of codons in the DNA sequence. The result is
color information instead of creating three separate images.     rounded up to give the width and rounded down to give the
Our approach will determine the width and height of the          height of the image. The two numbers are multiplied and if
array used for creating the image using the two closest fac-     the result is less than the number of codons, the smaller
tors of the number of codons. This will result in a picture      number is increased by 1. This will result in an array that is
that is as close to a square in shape as possible.               large enough for all codons, in some cases slightly larger.
   The DNA sequence is arranged in a two dimensional ar-         The extra space will be filled with white pixels in the re-
ray the same way as described by Yokoo and Oshima                sulting image.
(Yokoo & Oshima, 1978), but in our case the first base of
                                                                            Table 4. After translation into RGB:
each codon is used to encode the red portion, the second
base for the green portion, and the third for the blue portion    0,255,0   255,0,0       255,0,0        255,0,0     255,255,0
of each pixel. DNA bases are translated into RGB values           0,0,255   0,0,0         255,255,255    0,0,0       0,255,0
using the following coding tables:                                0,0,255   255,255,25    127,0,127      255,255,2   0,255,0
                                                                            5                            55
   Table 2. Translation of DNA bases to RGB values                0,0,255   0,0,0         255,255,255    0,0,0       0,255,0
                                                                  255,0,0   255,255,0     255,255,0      255,255,0   0,0,255
                       Base     RGB
                         A         0
                         C        64
                         G       128
                         T       255
   This method allows the encoding of 64 colors and en-
sures that the encoding of all the most common colors such
as red, green, blue, yellow, magenta, orange, grey, black,
and white is possible.


     Fig. 1. Resulting image (enlarged by factor 16)


Future research and conclusion
The use of only 64 colors obviously leads to the loss of
color information. Also, with the current algorithm the
program assumes that the width and height of an image are
as similar (a square, or approximately a square) as possi-
ble. For example, a 120x40 pixel image would be decoded
as a 60x80 pixel image. A possible solution would be to
encode the dimensions of the image as well. Our method is
simpler and more storage space efficient than the one de-
scribed by Goldman [6], but as a tradeoff can encode fewer
colors. It is also more specialized toward images, while
Goldman’s approach is geared toward a variety of data
types. Further research could lead to the development of
algorithms to detect, extract and decode images that have
been hidden in DNA sequences. These methods could be
used for forensic purposes. Similar algorithms have already
been developed for text-based DNA Steganalysis (Beck,
Desoky, Rouchka, & Yampolskiy, 2014).
References                                                             [16] Hennings, M. R., & Kettelberger, D. M. (2004). United
[1] Ailenberg, M., & Rotstein, O. (2009). An improved Huffman                     States of America Patent No.: U. S. P. T. Office.
           coding method for archiving text, images, and music         [17] Huffman, D. A. (1952). A Method for the Construction of
           characters in DNA. Biotechniques, 47(3), 747-754. doi:                 Minimum-Redundancy Codes. Proceedings of the IRE,
           10.2144/000113218                                                      40(9), 1098 - 1101. doi:
[2] Anam, B., Sakib, K., Hossain, A., & Dahal, K. (2010).                         10.1109/JRPROC.1952.273898
           Review on the Advancements of DNA Cryptography.             [18] Jiao, S.-H. (2009). Hiding data in DNA of living organisms.
           Paper presented at the International conference on                     Natural Science, 01(03), 181-184. doi:
           Software, Knowledge, Information Management and                        10.4236/ns.2009.13023
           Application, Paro, Bhutan.                                  [19] Jiao, S.-H., & Goutte, R. (2008). Code for Encryption Hiding
[3] Arita, M. (2004). Comma-free design for DNA words.                            Data into Genomic DNA. Paper presented at the 9th
           Communications of the ACM, 47(5), 99. doi:                             International Conference on Signal Processing, Beijing,
           10.1145/986213.986244                                                  China.
[4] Arita, M., & Ohashi, Y. (2004). Secret Signatures Inside           [20] Martin, R. G., Matthaei, J. H., Jones, O. W., & Nirenberg,
           Genomic DNA. Biotechnology Progress, 20(5).                            M. W. (1962). Ribonucleotide composition of the
[5] Beck, M. B., Desoky, A. H., Rouchka, E. C., & Yampolskiy,                     genetic code. Biochemical and biophysical research
           R. V. (2014). Decoding Methods for DNA                                 communications, 6(6), 410-414.
           Steganalysis. Paper presented at the Paper presented at     [21] Smith, G. C., Fiddes, C. C., Hawkins, J. P., & Cox, J. P. L.
           the 6th International Conference on Bioinformatics and                 (2003). Some possible codes for encrypting data in
           Computational Biology (BICoB 2014), Las Vegas,                         DNA. Biotechnology Letters, 25(14), 1125-1130.
           Nevada, USA.                                                [22] Wong, P. C., Wong, K.-K., & Foote, H. (2003). ORGANIC
[6] Beck, M. B., Rouchka, E. C., & Yampolskiy, R. V. (2013).                      DATA MEMORY Using the DNA Approach.
           Finding Data in DNA: Computer Forensic                                 Communications of the ACM, 46(1), 95-98.
           Investigations of Living Organisms. Lecture Notes of        [23] Yachie, N., Sekiyama, K., Sugahara, J., Ohashi, Y., &
           the Institute for Computer Sciences, Social Informatics                Tomita, M. (2007). Alignment-Based Approach for
           and Telecommunication Engineering, 114(2013), 204-                     Durable Data Storage into Living Organisms.
           219.                                                                   Biotechnology Progress 2007 Mar-Ap, 23(2), 501-505.
[7] Brenner, S., Stretton, A. O., & Kaplan, S. (1965). Genetic         [24] Yokoo, H., & Oshima, T. (1978). Is Bacteriophage X174
           Code: The 'Nonsense' Triplets for Chain Termination                    DNA a Message from Extraterrestrial Intelligence?
           and their Suppression. Nature, 05 June 1965; 206(988),                 Icarus, 38(1).
           994-998.
[8] Brenner, S., Williams, S. R., Vermaas, E. H., Storck, T.,
           Moon, K., & McCollum, C. (2000). In vitro cloning of
           complex mixtures of DNA on microbeads: Physical
           separation of differentially expressed cDNAs.
           Proceedings of the National Academy of Sciences of
           the United States of America, 97(4), 1665-1670.
[9] Church, G. M., Gao, Y., & Kosuri, S. (2012). Next-
           Generation Digital Information Storage in DNA.
           Science 28 September 2012, 337(6102),1628.doi:
           10.1126/science.293.5536.1763c
[10] Clelland, C. T., Risca, V., & Bancroft, C. (1999). Hiding
           messages in DNA microdots.pdf. Nature, 399(10), 533-
           534.
[11] Davis, J. (1996). Microvenus. Art Journal, 55(1), 70-74.
[12] Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N.,
           Chuang, R. Y., Algire, M. A., . . . Venter, J. C. (2010).
           Creation of a bacterial cell controlled by a chemically
           synthesized genome. Science, 329(5987), 52-56. doi:
           10.1126/science.1190719
[13] Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust,
           E. M., Sipos, B., & Birney, E. (2013). Towards
           practical, high-capacity, low-maintenance information
           storage in synthesized DNA. Nature, 494(7435), 77-80.
           doi: 10.1038/nature11875
[14] Heider, D., & Barnekow, A. (2007). DNA-based watermarks
           using the DNA-Crypt algorithm. BMC Bioinformatics,
           8, 176. doi: 10.1186/1471-2105-8-176
[15] Heider, D., & Barnekow, A. (2008). DNA watermarks: a
           proof of concept. BMC Mol Biol, 9, 40. doi:
           10.1186/1471-2199-9-40

</pre>