<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>DNA watermarks: a
proof of concept. BMC Mol Biol</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.2144/000113218</article-id>
      <title-group>
        <article-title>Hiding Color Images in DNA Sequences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc B. Beck</string-name>
          <email>mbbeck05@Louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman V. Yampolskiy</string-name>
          <email>roman.yampolskiy@Louisville.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cybersecurity Lab, Department of Computer Engineering and Computer Science, Speed School of Engineering, University of Louisville</institution>
          ,
          <addr-line>Louisville, KY 40292</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>9</volume>
      <issue>40</issue>
      <fpage>747</fpage>
      <lpage>754</lpage>
      <abstract>
        <p>With recent advances in genetic engineering it has become possible to embed artificial DNA strands into the living cells of organisms. With DNA having a great capability for data storage, many methods have been developed to insert artificial information into a DNA sequence. However, most of these methods focus on the encoding of text data and little research has been done regarding the encoding of other media. Few methods have been researched to encode images and most of those are only for black-and-white images. We are proposing an algorithm to insert and extract color images in the form of bitmap files into a DNA sequence in the form of a FASTA file. Results from our experiments show that the proposed method is significantly more efficient than previous approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Deoxyribose Nucleic Acid (DNA), the molecule that
carries the hereditary information for every living organism, is
a double helix with two anti-parallel strands. These strands
contain four different nucleotides which are distinguished
by the bases adenine (A), cytosine (C), guanine (G), and
thiamine (T). Using combinations of those four
nucleotides, DNA has the potential to store vast amounts of data
within genomes, the length of which can range up to
several billion bases (Anam, Sakib, Hossain, &amp; Dahal, 2010).</p>
      <p>
        Certain regions within a genomic sequence translate into
genes that produce proteins consisting of amino acids.
Marshall Nirenberg
        <xref ref-type="bibr" rid="ref5">(Martin, Matthaei, Jones, &amp; Nirenberg,
1962)</xref>
        first discovered the genetic code, which dictates how
the combinations of the four bases of DNA are translated
into twenty amino acids.
      </p>
      <p>A sequence of three nucleotides that determines which
amino acid will be incorporated next during protein
synthesis is called a codon. The four nucleotides can be
combined into 43=64 unique codons. Each codon, except the
three STOP codons TAA, TAG, and TGA (Brenner,
Copyright held by the author(s).</p>
      <p>Stretton, &amp; Kaplan, 1965; Martin et al., 1962) encodes for
one of the twenty amino acids. This means that multiple
codons encode the same amino acid, allowing for
degeneracy.</p>
      <p>It has become possible through recent advances in
genetic engineering to insert artificial DNA sequences into the
living cells of organisms (Gibson et al., 2010). This allows
the insertion of information into the DNA strands of
organisms such as bacteria for applications like data storage,
watermarking, or communication of secret messages.</p>
    </sec>
    <sec id="sec-2">
      <title>DNA as Storage Medium</title>
      <p>
        Researchers are investigating DNA as an ultra-compact,
long-term data storage medium
        <xref ref-type="bibr" rid="ref7">(Church, Gao, &amp; Kosuri,
2012; Goldman et al., 2013; Wong, Wong, &amp; Foote, 2003)</xref>
        as well as a stegomedium for hiding information
        <xref ref-type="bibr" rid="ref6">(Smith,
Fiddes, Hawkins, &amp; Cox, 2003)</xref>
        . Messages are expressed
as a series of As, Cs, Gs, and Ts in DNA code as opposed
to ones and zeroes in binary code. In order to encode
messages in DNA, researchers have developed various
algorithms that can either insert a message into an existing
DNA sequence
        <xref ref-type="bibr" rid="ref3">(Jiao, 2009)</xref>
        , or disguise it as a new one.
These artificial DNA strands can be inserted into the
genomes of living organisms, which has been proven
possible by Venter (Gibson et al., 2010) and others
        <xref ref-type="bibr" rid="ref3">(Jiao, 2009)</xref>
        ,
        <xref ref-type="bibr" rid="ref4 ref8">(Arita, 2004; Brenner et al., 2000; Heider &amp; Barnekow,
2008; Jiao &amp; Goutte, 2008; Yachie, Sekiyama, Sugahara,
Ohashi, &amp; Tomita, 2007)</xref>
        ].
      </p>
      <p>The first cell with a synthetic genome was created in
2010 by Craig Venter, who led the private effort to
sequence the human genome. Venter and his team at the J.
Craig Venter Institute (JCVI) modified a computer file of
the DNA sequence of the bacterium Mycoplasma
mycoides. They then produced physical DNA from this
sequence, which they inserted into a cell. This cell
reproduced under control of the artificial DNA to create a new
bacterium [4].</p>
      <p>High density, redundancy, and a long life expectancy are
some of the many advantages of using DNA as a data
storage medium.</p>
    </sec>
    <sec id="sec-3">
      <title>DNA Steganography</title>
      <p>
        Steganography is the science of transmitting a message
hidden inside a cover medium in a way that no one other
than the sender and the intended recipient suspect its
existence. The goal of steganography is to avoid suspicion to
the existence of the message, while cryptography merely
aims at making a message unreadable
        <xref ref-type="bibr" rid="ref7">(Wong et al., 2003)</xref>
        .
Its properties as a data storage medium also make DNA a
good medium for steganography.
      </p>
      <p>Noncoding genomic regions may seem to be an obvious
choice of locations for inserting messages. However, the
biological purpose of these regions is not yet fully
understood [11] and tampering with them might possibly cause
the death of the organism.</p>
      <p>Arita et al. (Arita, 2004) suggested to encode the
message in the protein coding regions of genes instead. With
20 amino acids and one stop symbol using a total of 64
possible codons (Arita &amp; Ohashi, 2004), there is
redundancy where often two or more codons code for the same
amino acid. This redundancy can be used to embed messages,
since many of these redundant, or synonymous, codons
typically differ in their third position, also known as the
wobble base (Brenner et al., 1965). This means that
changing the wobble base to hide messages will not affect the
coded amino acid.</p>
    </sec>
    <sec id="sec-4">
      <title>Coding Schemes for Hiding Text in DNA</title>
      <p>A code is a cryptographic rule that determines which
symbol from a target alphabet uniquely represents which
symbol from a source alphabet. In DNA Steganography, the
source alphabet is made of alphanumeric characters the
case of text information, or color values of pixels in the
case of images. The target alphabet consists of various
combinations of the initials of the four nucleotides.</p>
      <p>Which symbol in the target alphabet is chosen to
represent which symbol in the source alphabet is determined by
a set of rules called a coding scheme.</p>
      <p>
        We have already compared several existing coding
schemes in an earlier publication (Beck, Rouchka, &amp;
Yampolskiy, 2013). Simple substitution ciphers are the
most common ones [7, (Clelland, Risca, &amp; Bancroft,
1999)]. Other, more sophisticated coding schemes exist,
which are either geared toward error detection and
correction [(Arita &amp; Ohashi, 2004), (Heider &amp; Barnekow, 2007)],
or optimization of the available capacity to hide messages
        <xref ref-type="bibr" rid="ref2">(Huffman, 1952)</xref>
        , (Ailenberg &amp; Rotstein, 2009).
      </p>
    </sec>
    <sec id="sec-5">
      <title>Insertion of Media other than Text</title>
      <p>Most research in DNA steganography focuses on hiding
text and only very little research has been done so far on
hiding other media in a DNA sequence. Goldman et al.
(Goldman et al., 2013) describe encoding five files of
various types in a DNA sequence. These files include a JPEG
2000 image and a speech in MP3 format. The coding
scheme they used utilizes several intermediate steps. First,
the image file and the sound file are translated into binary.
Then, the text file and the binary data from the other files
is translated into a base-3 code and finally into sequences
of DNA bases.</p>
      <p>Davis (Davis, 1996) describes a method of encoding the
black-and-white image of a relatively simple shape (5 by 7
bitmap) into a 35 bit binary sequence, which was then
compressed. His approach compares the molecular weights
of the bases to obtain an incremental reference. Starting
with the smallest base, Cytosine, Davis assigns numbers to
the bases in ascending order. This results in C = 1, T =2, A
=3, and G =4. This method compresses the binary digits of
the bit-mapped image into fewer DNA base symbols by
using each base to indicate how many times each binary
value (0 or 1) is to be repeated before changing to the
respective other value. This technique is widely used in data
compression. This can be represented as shown in Table 1.
Using this coding method, the thirty-five-bit
black-andwhite image is translated to only eighteen DNA bases:
CCCCCCAACGCGCGCGCT
These can be decoded to yield one of the two following
binary sequences:
10101011100010000100001000010000100
or
01010100011101111011110111101111011
This depends on if either a 1 or a 0 is chosen to start the
decoding sequence. Transforming either of the two
sequences into the correct five-by-seven matrix will produce
the image. Since the example used by Davis is bilaterally
symmetrical, more than one of several possible five-by
seven matrices will in this case result in producing the
correct bitmap [22].</p>
      <p>Ailenberg and Rotstein (Ailenberg &amp; Rotstein, 2009)
have developed a coding scheme to encode an image that is
composed of shapes and their coordinates.</p>
      <p>
        This way of encoding an image is not very efficient. A
more feasible approach has been described by Yokoo and
Oshima
        <xref ref-type="bibr" rid="ref9">(Yokoo &amp; Oshima, 1978)</xref>
        . This approach suggests
to arrange the 3-base codons of a DNA sequence in a two
dimensional array and then translate one base of each
codon into either black or white, with G and C being black
and A and T being white, or vice versa. This is done for
each base of all the codons, which would result in three
separate images.
      </p>
      <p>
        Hennings and Kettelberger
        <xref ref-type="bibr" rid="ref1">(Hennings &amp; Kettelberger,
2004)</xref>
        have developed a method to generate music by
decoding and transcribing genetic information within a DNA
sequence into a music signal having melody and harmony.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Methodology</title>
      <p>We have developed a very similar coding scheme to the
one described by Yokoo and Oshima [23], with the
difference that we use all three bases of each codon for encoding
color information instead of creating three separate images.
Our approach will determine the width and height of the
array used for creating the image using the two closest
factors of the number of codons. This will result in a picture
that is as close to a square in shape as possible.</p>
      <p>
        The DNA sequence is arranged in a two dimensional
array the same way as described by Yokoo and Oshima
        <xref ref-type="bibr" rid="ref9">(Yokoo &amp; Oshima, 1978)</xref>
        , but in our case the first base of
each codon is used to encode the red portion, the second
base for the green portion, and the third for the blue portion
of each pixel. DNA bases are translated into RGB values
using the following coding tables:
      </p>
    </sec>
    <sec id="sec-7">
      <title>Results</title>
      <p>Each codon encodes one pixel and the coordinates of the
codon in the array will be the coordinates of the pixel in
the resulting bitmap. The following example shows each
step of the encoding process:</p>
      <sec id="sec-7-1">
        <title>DNA sequence: ATA TAA TAA TAA TTA AAT AAA TTT AAA ATA AAT TTT GAG TTT ATA AAT AAA TTT AAA ATA TAA TTA TTA TTA AAT</title>
      </sec>
      <sec id="sec-7-2">
        <title>DNA sequence as two dimensional array:</title>
        <p>ATA TAA TAA TAA TTA
AAT AAA TTT AAA ATA
AAT TTT GAG TTT ATA
AAT AAA TTT AAA ATA
TAA TTA TTA TTA AAT</p>
        <p>The array is created by taking the square root of the
number of codons in the DNA sequence. The result is
rounded up to give the width and rounded down to give the
height of the image. The two numbers are multiplied and if
the result is less than the number of codons, the smaller
number is increased by 1. This will result in an array that is
large enough for all codons, in some cases slightly larger.
The extra space will be filled with white pixels in the
resulting image.</p>
        <p>This method allows the encoding of 64 colors and
ensures that the encoding of all the most common colors such
as red, green, blue, yellow, magenta, orange, grey, black,
and white is possible.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Future research and conclusion</title>
      <p>The use of only 64 colors obviously leads to the loss of
color information. Also, with the current algorithm the
program assumes that the width and height of an image are
as similar (a square, or approximately a square) as
possible. For example, a 120x40 pixel image would be decoded
as a 60x80 pixel image. A possible solution would be to
encode the dimensions of the image as well. Our method is
simpler and more storage space efficient than the one
described by Goldman [6], but as a tradeoff can encode fewer
colors. It is also more specialized toward images, while
Goldman’s approach is geared toward a variety of data
types. Further research could lead to the development of
algorithms to detect, extract and decode images that have
been hidden in DNA sequences. These methods could be
used for forensic purposes. Similar algorithms have already
been developed for text-based DNA Steganalysis (Beck,
Desoky, Rouchka, &amp; Yampolskiy, 2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hennings</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kettelberger</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          (
          <year>2004</year>
          ). United States of America Patent No.: U. S. P. T. Office.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Huffman</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          (
          <year>1952</year>
          ).
          <article-title>A Method for the Construction of Minimum-Redundancy Codes</article-title>
          .
          <source>Proceedings of the IRE</source>
          ,
          <volume>40</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1098</fpage>
          -
          <lpage>1101</lpage>
          . doi:
          <volume>10</volume>
          .1109/JRPROC.
          <year>1952</year>
          .273898
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jiao</surname>
            ,
            <given-names>S.-H.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Hiding data in DNA of living organisms</article-title>
          .
          <source>Natural Science</source>
          ,
          <volume>01</volume>
          (
          <issue>03</issue>
          ),
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          . doi:
          <volume>10</volume>
          .4236/ns.
          <year>2009</year>
          .13023
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jiao</surname>
            ,
            <given-names>S.-H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Goutte</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Code for Encryption Hiding Data into Genomic DNA</article-title>
          .
          <source>Paper presented at the 9th International Conference on Signal Processing</source>
          , Beijing, China.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matthaei</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>O. W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Nirenberg</surname>
            ,
            <given-names>M. W.</given-names>
          </string-name>
          (
          <year>1962</year>
          ).
          <article-title>Ribonucleotide composition of the genetic code</article-title>
          .
          <source>Biochemical and biophysical research communications</source>
          ,
          <volume>6</volume>
          (
          <issue>6</issue>
          ),
          <fpage>410</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>G. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fiddes</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>J. P. L.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Some possible codes for encrypting data in DNA</article-title>
          .
          <source>Biotechnology Letters</source>
          ,
          <volume>25</volume>
          (
          <issue>14</issue>
          ),
          <fpage>1125</fpage>
          -
          <lpage>1130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
          </string-name>
          , K.
          <article-title>-</article-title>
          K., &amp;
          <string-name>
            <surname>Foote</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>ORGANIC DATA MEMORY Using the DNA Approach</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ),
          <fpage>95</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Yachie</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sekiyama</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sugahara</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohashi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tomita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Alignment-Based Approach for Durable Data Storage into Living Organisms</article-title>
          .
          <source>Biotechnology Progress 2007 Mar-Ap</source>
          ,
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <fpage>501</fpage>
          -
          <lpage>505</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Yokoo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Oshima</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1978</year>
          ).
          <article-title>Is Bacteriophage X174 DNA a Message from Extraterrestrial Intelligence</article-title>
          ? Icarus,
          <volume>38</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>