A System for Compression of Sequencing Data Boryana Pulova-Mihaylova, Iliyan Mihaylov, Irena Avdjieva, and Dimitar Vassilev FMI, University of Sofia St. Kliment Ohridski, 5 James Bourchier Blvd., Sofia 1164, Bulgaria dimitar.vassilev@fmi.uni-sofia.bg Abstract. The development of genomic sequencing technologies is directly related to the storage, analysis, visualization of huge amount of sequencing data. These data, in terms of quantity and quality, are a huge challenge for modern computer science and bioinformatics related to compression and decompression of huge data sets. The present study is devoted to the development of models and their implementation for compression of sequencing data. The main aim of the work is to develop a web-based system for sequencing data compression that provides opportunities for faster and more accurate compression and decompression, noise protection and error correction. For the purposes of this study, we developed models for compression and decompression based on the methods of literal coding. The developed Optimized algorithm is related to the well-known methods of Huffman and Shanon-Fano. Also are developed a web-based module (with user interface), a library of noise protection algorithms for the compression and decompression of omics data, and a server component to make the library accessible on the Internet. The implemented web module with user interface provides easy access to server methods for compression and decompression of sequences. Used data formats allows the module to be integrated with open access systems such as NCBI, UniProt, Ensembl and other well-known external resources. Keywords: Sequencing data, Compression, Noise protection, Web-based system. 1 Introduction The introduction of high-throughput sequencing technologies is accompanied by generation of large amounts of biological data. The trend of increasing sets of biological sequence data is due to the following main reasons:  The declining cost of genome sequencing provides conditions for even larger projects related to increased requirements for analysis of such data.  Increasing knowledge about the relationship between genotype and phenotype creates enormous opportunities for more specific in-depth research of genome, especially in the context of personalized medicine. Sequencing platforms are advancing both in length of sequencing reads and in speed of data generation [1]. Copyright © 2020 for this paper by 223its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Typically, storing and managing DNA data, or sequential data, is by encoded ASCII characters, which results in one byte for each base. Although this storing approach is wasteful in terms of space, since no compression is applied, many experts in the analysis of such data are accustomed to this data format. Its advantages are that it allows software applications and programming languages easily to access data, and also it is easy readable by humans. Of course, the big drawback of storing, analyzing and transferring data in this format is the need for huge and expensive resources of computer memory. The situation becomes much worse when not only ready-made sequenced genomes are stored, but also raw data of a certain quality. Presently, storing 10,000 genomes is a work in progress, but the rapid pace of sequencing, as well as new third-generation sequencing technologies, pose major challenges to the data being generated. [2] Most approaches for sequencing data compression are based on two of the most popular algorithms in recent years - Shannon-Fano and Huffman. These algorithms are compared to an optimized new version, developed in our study, which we will hereafter refer to as the “Optimized Algorithm”, based on the theory of literal codes, but with an improvement in code uniformity and compression / decompression speed. Huffman and Shannon-Fano’s algorithms result in compression as a result of non-uniform code, the main disadvantage of which is that it is prone to loss of information (in data transfer) and hence incorrect data recovery. With the Optimized Algorithm, a uniform code is obtained, which would easily identify the data loss and prevent the wrong data recovery. In addition, for large volumes of data (long DNA / RNA sequences) compression by a uniform code produces better results than a non-uniform compression. 2 Related work There are several basic types of compression approaches according to the data processing method. They have some features in common but they are quite different both methodologically and in scope [3]. The main methods for compression of nucleotide sequencing data are based on statistics and entropy - by exploiting compression algorithms using repetition detection and consequence statistics [4]. The increasing number of re-sequenced genomes has led to many new suggestions for compression algorithms. In general, compression algorithms can be divided into bitwise encoding, vocabulary-based compression, statistical and referential approaches. Bitwise encoding - a compression algorithm that can process input strings of any format. For the first common DNA characters (A, C, G, T as well as N (unknown)), seven bit encoding is used for three consecutive characters. Any other non-standard character is encoded with seven bits per character. Up to 128 224 additional characters can be encoded in this way. The free eight bit distinguishes whether a byte encodes three of the first main characters or some other, specific character [3]. Dictionary-based compression - performs with multiple iterations over the input data. With each pass, it detects longer sequences and gradually enriches the dictionary. The algorithm runs until the dictionary is changed or the frequency threshold is reached. The created dictionary is used to encode DNA sequences and is partially stored together with compressed data. This encoding allows random access to the compressed DNA sequences [5]. Another approach is so-called tree-based compression [6]. Splay trees are self-tuning binary search trees in which items that are last accessed are accessed faster. This property can improve the performance when you encounter local frequency changes. The methodology refers to the process of rearranging the trees as a specific element is placed at the root of the tree. Like evolutionary trees, symbols close to the root have shorter codes than symbols in other nodes. Another problem is the unequal distribution of DNA across the genome. One possible solution is to split the input of blocks and encode each block individually using different Markov models [7]. In [8], a “gene-wise compressor” is proposed as a mean for encoding non-repetitive portions of DNA sequences. In the initial implementation, the probabilities of the characters are evaluated and the corresponding Huffman coding is selected. The result is divided into blocks. Finally, each block is restructured in a way that allows efficient coding of the execution length to be used in the final step. Another approach is the input sequence is fragmented into blocks that do not overlap. For each block, a set of agents competes for coding: a Markov model, a two-bit base representation, and an approximate repetition that identifies repetitions interrupted by only a few SNPs (single nucleotide polymorphisms). The model that broadcasts the shortest code length is selected and the compressed output is subjected to additional arithmetic compression. Unfortunately, this approach can handle only four main characters in the input sequences. A method is proposed where non-repetitive regions are encoded as well as discrepancies in repetitive regions or mutations like SNPs with arithmetical coder, based on a Markov model. The resulting bit stream is split into blocks to allow random access. Mixed approach for compression: there are many variations and different views and discussions in the literature. It is suggested that to store the differences between the input compression sequence and the reference sequence [9]. They look at three types of single based features: insert, delete, and replace. The main contribution of their work is to analyze the process of encoding integers for absolute and relative reference positions in the sequence. However, the authors emphasize that the choice of reference sequence has a greater impact on the 225 compression ratio than the actual overall coding scheme. Similarly, other sources present a compression reference algorithm that only accepts SNPs and many basic INDELs between inputs and reference sequences. Each compression record consists of a reference position and additional data such as coincidence length or raw sequence. Table 1. Compression algorithms comparison Compression method Compression Bitwise encoding 2:1 – 6:1 Dictionary-based compression 4:1 – 6:1 Statistical model 4:1 – 8:1 Mixed model 1:1 – 400:1 Once the sequence is compressed, it must be sent over the network. The algorithms used for compression, namely Huffman, Shannon-Fano, and the optimized algorithm, which belongs to the class of literal coding, are very susceptible to errors. If even one bit is confused when transmitting network information or reading to and from the recorder, restoring the correct sequence is not possible. For this purpose, it is necessary to use error correction algorithms. Literature sources cite noise protection algorithms as one of the most widely used for this purpose [10]. They are applicable in all places where it is necessary to correct errors at the lowest possible level (bitwise). In damping waves, artificial intelligence models are also widely used in pattern recognition and error correction in generated images. One of the most common algorithms is that of Reed Mahler, suitable for data transmission in places where the connection is poor or very weak [10]. The main goals of the work are: to present a new method for compression of nucleotide sequencing data with improved performance, accuracy and noise protection based on the known methods of Huffman, Shannon-Fano and Reed- Miller, as well as to develop an interoperable and reusable software implementation of the elaborated method. 3 Suggested methodology The suggested compression approach in this study is based on coding and more specifically on the Shannon-Fano and Huffman algorithms. The main idea behind the Shannon-Fano and Huffman algorithms is to set a binary code to match each of the characters in the input message (in our case bases in the input sequence). The algorithm has the following characteristic properties:  The length of the codes is variable (the code is uneven). Here is our first modification of the algorithm, where the length of the codes is always the same i.e. the compressed message has a uniform code. The sequences being 226 considered for compression are of the same length, and this comes from the most used formats such as FASTA and FASTQ, where the length of the sequence is always specified, which could be used in advance to build a classification tree. Another feature of the code length is that when the individual characters in the message are replaced with their corresponding binary values, a compressed message of different sizes may be received. This is because for each symbol, a different binary value is given, which in some cases can be 1 and in others, it can be a combination of 0 and 1. For example, for “A” a value of 11 can be calculated, for “T” - a value equal to 101, which causes also change in the string size. This can be improved when using DNA sequence data because they have a small number of characters that can be encoded with equal length.  The letters with higher probability are more likely to be encoded in fewer bits than those with less likely ones. In addition, this is the next modification of the algorithm we suggest: an approach how to calculate the probability of occurrence of individual characters in the message using frequency analysis, which is the most time consuming process in this algorithm. In the case of sequence compression, we make an improvement where there is no need to perform frequency analysis because the characters included in the sequence are countable small number. For DNA and RNA sequences, the symbols are 4 in number, for which mapping (aligning) has been prepared preliminary for their respective binary sequences.  The message shall be decoded unambiguously. Advantages: When we encode the more common characters with shorter bit sequences and the less common ones with longer bit sequences, we get better results with respect to the length of the message being transmitted. Disadvantages: A major disadvantage of uneven codes is their sensitivity to wrong bits. For them, a wrong bit can cause the entire message to be decoded incorrectly or impossible until the end. Whereas, with uniform codes, one wrong bit causes only one character to be corrupted. 3.1 Building the optimal code An optimized algorithm developed in the study, according to which sequences based on the theory of optimal letter coding will be compressed has following operational features:  Sequence type recognition (RNA or DNA).  Depending on the sequence, predefined binary values for each of the symbols used in the sequence are selected - using the predefined values saves the 227 following procedures, which are necessary to construct the optimal code for each arbitrary message in the Shannon-Fano algorithms and Huffman: a The set of symbols of source A is arranged in order to reduce the probability of message occurrence – pj b The set of probabilities Pi is divided into two groups (p1, p2, …, pj) and (pj+1, pj+2, …, pm), so that the difference shown below is minimal: (1) c The characters whose probabilities are in the first group are assigned with the r-th code letter 0 and those in the second group are assigned with the r-th code letter 1.  For a single-element probability group, the procedure is completed, and for each of the other groups in the multi-step procedure, the elements are numbered from 1 to m and go recursively (r = r + 1) to step 2.  Each of the characters in the sequence is replaced by the corresponding binary entry. Following the changes made to the Shannon-Fano algorithm and the Huffman algorithm, all properties of the letter algorithms coming from the definitions and theorems related to those algorithms retain, ignoring the steps for frequency analysis and computation of the binary comparison tree. Once the sequence is compressed, it can be stored or sent in a much smaller. Reading the compressed sequence is another major problem with the Shannon- Fano and Huffman algorithms. In the Optimized Algorithm suggested in this study, this problem is solved by using a predefined coding tree. In Shannon-Fano and Huffman’s algorithms, in order to decode a properly transmitted message, it is necessary to have the same encoding tree generated when compressing the message. The transfer of message for both algorithms consists of two parts - the encoded message, and the coding tree so that the message can be read. These two parts do not always need to be transferred together. Because algorithms can also be used in the context of information security, where not everyone can decode the message, i.e. not everyone should have a coding tree. In bioinformatics, this is an important feature for working with sequences. In the proposed method, only one has a coding tree. Each compressed / encoded sequence can only be read by the same programming apparatus with which it was compressed / encoded. This is because a unique hashing algorithm 228 is used to generate the coding tree in order to ensure the uniqueness of each coding tree. Each sequence has its own encoding tree and can only be decoded / decompressed by this encoding tree. This approach eliminates the shortcomings of both Shannon-Fano and Huffman’s algorithms for decoding and protecting information. In the proposed Optimized Approach, each of the characters will be encoded with exactly 2 bits. As before the compression, each character occupied 8 bits, here is the first place where we have a 1:4 compression. After applying the algorithms, a compression of up to 1:400 can be achieved. 3.2 Compressing algorithm comparative analysis The three algorithms’ coding/decoding abilities were tested with a 20-letter, 160- bit sample DNA sequence: AACTTTGACGGTATACGCAA. The Shannon-Fano and Huffman algorithms code the sample in three steps: 1) symbol frequency analysis, 2) grouping, 3) assigning each symbol a binary code. Once the binary codes for each of the four symbols are assigned, a coding binary tree is generated, resulting in coding the sequence as 41-bit string: 00110101010111011011111110010011011111000 Decoding requires the generated binary code and the coding tree, which is traced from root to leaves depending on the input symbol – left path for 0 and right – for 1. The Optimized Algorithm omits the frequency calculation step an assigns a two-digit bit code to each of the four symbols: A=00, C=01, G=11, T=10. This coding table is static and does not need to be generated each time. Coding is done by simple substitution, resulting in the following 40-bit string: 0000011010101100011111100010000111010000 Decoding the sequence requires the coded sequence and the static table, thus omitting the need of a coding tree. As shown in Table 2, the Optimised Method produces 5% less bit length than Shanon-Fano and Huffman methods. The result code is also uniform, which may prevent possible bit loss, while in uneven code such errors cannot be detected due to different length of bits in coding sequence. Table 2. Comparative analysis of compression algorithms Coding method Input length Compressed length Shannon-Fano 20 bases, 160 bits 41 bits Huffman 20 bases, 160 bits 41 bits Optimized method 20 bases, 160 bits 40 bits 3.3 Software realization and user interface The application utilizes several platforms combined under Ubuntu Linux OS. There are three main software components: 229  User interface  C++ library  Programming adaptor to connect the library to an intermediate language The user interface is based on Vue.js, Vuetifie, node.js, and npm version manager. Vue.js uses JavaScript templates to generate HTML in real time, which allows the interface to work on different devises and different screen resolutions, because each HTML is generated for the specific devise. Vuetifieе is a programming module that upgrades Vue.js by adding predefined templates. Node.js allows JavaScript to be used as server provisioning method by dynamic building in executable code. It provides the possibility to call each individual HTML element in the user interface separately, and visualize the result when ready – an approach called „reactive programming“. JavaScript was chosen because of its compatibility with RESTful services and JSON data transfer format. Version management of the used Vue.js, Vuetifile and Node.js modules is done by npm, which allows multiple storage for the modules to be registered, and automatically detects and downloads them. 3.4 Choosing programming medium for the compression algorithms C++ was chosen for the algorithms for literal and noise protection coding, for its execution speed and lower memory usage, compared to other languages. It also allows direct memory access, easy and intuitive realisation of bitwise operations and lower-level resource management. These advantages make algorithms written in C++ more optimal than C# and Java. This is one of the main reasons to include connections between platforms in this application, because it utilises both C++ and C#. 4 Results and Discussion Our web-based application for sequence data compression is adapted for various device screen resolutions – from mobile phone (375x812) and tablet (768z1024) to 8k (7680x4320). The user interface is shown on Figure 1 (mobile devise). Filed (B) allows the selection of file(s) to be compressed. The application accepts FASTA and MULTIFASTA file formats, and one or multiple files can be selected, thus allowing parallel compression. The selected files are listed as labels (A) in field (B), and can be deleted by pressing the “clear” symbol (C). 230 Fig. 1. User interface – File input field (B) with one file uploaded (A) and confirmation button (D) Button (D) confirms the query and shows how many files are selected for compression. Then, the files are parsed to extract metadata such as sequence IDs, file size (in bytes), and sequence length (in number of nucleotides). For test compression, four sequences with different length from the Ensembl database were saved as a single, multifasta file: • ENST00000288602.11 - BRAF-201, B-Raf proto-oncogene serin/threonine kinase • ENSE00001025715.1 chromosome:GRCh38:X:38317316:38317465:-1 • ENSE00001025717.1 chromosome:GRCh38:X:38321027:38321089:-1 • ENSG00000157764.13 chromosome: GRCh38:7:140719327:140924928:-1 Each sequence is compressed with the three algorithms, and the time needed for compression is calculated. Figure 2 shows the result for each sequence in the file as a separate, auto-expandable field (Б). Each field shows information about the sequence: Fig. 2. Visualization of compressed sequences • Sequence name (A) – this is obtained from the FASTA format, and if it matches a naming pattern from an entry in one of the public sequence databases, a hyperlink to that entry is created 231 • Current sequence size [in bytes] (B) – this is the sum of the byte size of the ASCII characters corresponding to the letters coding each nucleotide. Then, depending on the sequence length, it is converted to the closest B, KB, MB, GB of TB value by following the International Electrotechnical Commission (IEC) standard where 1 KB = 1024 B. • Drop-down arrow (Г) shows/hides detailed information about the compressed sequence • In the detailed field, (Д) is the sequence name and size label, and by clicking on it the original, uncompressed sequence is shown. (Ж) shows the sequence after compression (E) by each of the three algorithms. Additionally, for each algorithm, a summary of the size [bytes] and time [seconds] of compression is shown. In the current example, there is practically no difference in the compressed sequence size between the three algorithms; considerable difference is seen when compressing sequences of several GB in size. Different access points on the server are used for each of the three sequence compression algorithms. The calling for each algorithm is also parallelized for each sequence, as seen in Figure 3: Fig. 3. Parallel execution of queries for each compression algorithm Another part of the server provides remote access to compression algorithms, developed as libraries. The library is accessed either through the server or by downloading and compiling it, thus making it usable by applications without Internet access. The library can be used as a DLL from several high-level programming languages, such as C# and Java. During development, several compression algorithms were tested, and the performance of each is summarized on Figure 4: 232 Fig. 4. Compression methods comparison: left vertical scale shows size in bits, right scale shows time in milliseconds. Bars indicate the sequence size after compression, and line shows the time. The illustrated above is based on the compression of a 9,219,726 byte sequence, obtained from the public database Genome, part of NCBI. This is the entire genome sequence of the bacteria Bradyrhizobium japonicum, which can be accessed from Genome’s own FTP servers, or through the RefSeq database under the ID NC_017249.1. Six compression methods were applied to the sequence, including the already discussed in this paper Shannon-Fano, Huffman, Optimized Algorithm. They were compared to: 1) Zip algorithm, developed for compressing and archiving one or multiple files; 2) Tar.xz – UNIX and Windows-based compression software that allows no-loss compression; and 3) 7zip – open-source file archiving software. The results clearly show the advantage of the first three algorithms that utilize bitwise coding, which result in a very similar performance, producing smaller compressed file and are generally faster than the others (with the exception of Zip). Huffman and Shannon-Fano give similar results when working with DNA sequences. This is due to the fact that DNA uses only 4-letter alphabet, thus creating similar frequency analyses and overlapping trees with each of those algorithms. The other three methods perform poorly due to them being versatile for various data types and not having been optimized to work with sequences. The main advantage of our Optimized Algorithm is the speed, because it omits the multiple frequency calculations and tree generation of the other two. It also performs better with sequence decompression, not needing to send or store the compressed sequence tree. This results in faster and safer data transfer, because a sequence-coding tree is also sensitive information in terms of software security. All datasets, samples and the software realization itself, are available: https://github.com/BroyanaPulova/DNA_compression_web https://github.com/BroyanaPulova/DNA_compression 233 5 Conclusion The Optimised Algorithm proposed here has been developed for sequence data compression. It performs better in terms of speed, noise-reduction, and compression size than similar methods such as Shannon-Fano and Huffman. A library with these bitwise coding algorithms was created, with compression/ decompression methods for each. The library utilises low-level primitives so that it can be re-used on various platforms. The Optimised sequence compression algorithm was implemented in a number of software products with web-based user interface that allows easy, multi-device and multi-resolution access to the server. All components work in a container-based environment, which allows fast and easy regulation. Future work on this project is going to expand its abilities by adding more compression algorithms, the ability to choose which one(s) to use, and updating the library to work with more programming Acknowledgements This work is supported by the project “GloBIG: A Model of Integration of Cloud Framework for Hybrid Massive Parallelism and its Application for Analysis and Automated Semantic Enhancement of Big Heterogeneous Data Collections”; contract DN02/9/2016. References [1] Kahn, S.D. (2011) On the future of genomic data. Science (80-. ). https://doi.org/10.1126/ science.1197891 [2] Via, M., Gignoux, C. and Burchard, E.G. (2010) The 1000 Genomes Project: New opportunities for research and social challenges. Genome Med. https://doi.org/10.1186/gm124 [3] Hudson, T.J., Anderson, W., Aretz, A., Barker, A.D., Bell, C., Bernabé, R.R. et al. (2010) International network of cancer genome projects. Nature. https://doi.org/10.1038/nature08987 [4] Trelles, O., Prins, P., Snir, M. and Jansen, R.C. (2011) Big data, but are we ready? Nat. Rev. Genet. https://doi.org/10.1038/nrg2857-c1 [5] Schadt, E.E., Turner, S. and Kasarskis, A. (2010) A window into third-generation sequencing. Human Molecular Genetics,. https://doi.org/10.1093/hmg/ddq416 [6] Kuruppu, S., Beresford-Smith, B., Conway, T. and Zobel, J. (2012) Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics,. https://doi.org/10.1109/TCBB.2011.82 [7] Cao, M.D., Dix, T.I., Allison, L. and Mears, C. (2007) A simple statistical algorithm for biological sequence compression. Data Compression Conference Proceedings,. https://doi. org/10.1109/DCC.2007.7 [8] Antoniou, D., Theodoridis, E. and Tsakalidis, A. (2010) Compressing biological sequences using self adjusting data structures. Proceedings of the IEEE/EMBS Region 8 International Conference on Information Technology Applications in Biomedicine, ITAB,. https://doi. org/10.1109/ITAB.2010.5687689 234 [9] Kaipa, K.K., Bopardikar, A.S., Abhilash, S., Venkataraman, P., Lee, K., Ahn, T. et al. (2010) Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location. 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2010,. https://doi.org/10.1109/BIBMW.2010.5703941 [10] Pratas, D. and Pinho, A.J. (2011) Compressing the human genome using exclusively Markov models. Advances in Intelligent and Soft Computing,. https://doi.org/10.1007/978-3-642- 19914-1_29 235