=Paper=
{{Paper
|id=Vol-3609/paper28
|storemode=property
|title=Genome Compression Using Program Synthesis
|pdfUrl=https://ceur-ws.org/Vol-3609/paper21.pdf
|volume=Vol-3609
|authors=Denys Maletskyi,Volodymyr Shymanskyi
|dblpUrl=https://dblp.org/rec/conf/iddm/MaletskyiS23
}}
==Genome Compression Using Program Synthesis==
Genome Compression Using Program Synthesis Denys Maletskyia and Volodymyr Shymanskyia a Lviv Polytechnic National University, Lviv, Ukraine Abstract With the growing amount of genomic data, finding efficient compression methods is crucial for both storing and analyzing this data. This article discusses the importance of genome data compression, especially in large genomic projects. It introduces program synthesis as a powerful tool for data compression. Program synthesis can create programs that efficiently represent and reproduce data, making it a promising approach for compressing genomes. Readers will learn about program synthesis, how it works, and how it can be used for genome data compression. The article shows how we can use program synthesis to compress genome sequences better. Although the focus is on genomic data, the methods and insights shared in this article can also be applied to compressing other types of data, offering a novel perspective to the current discourse on data compression strategies. Keywords 1 Program synthesis, compression, encoding, decoding, genome, DNA 1. Introduction With the rapid progression of high-throughput sequencing technologies, the amount of genomic data available has surged exponentially, necessitating the development of efficient genome compression methods. As the deluge of genomic data continues, the importance of devising innovative and effective genome compression techniques cannot be overstated. Recent literature unfolds various approaches to genome compression. Reference-based compression algorithms, such as CRAM [1] and RENANO [2], present a noteworthy approach to genome data compression. These methodologies achieve substantial compression rates by utilizing a reference genome to represent individual sequences efficiently. While promising in their compression efficacy, these algorithms carry the significant drawbacks of extended compression time and considerable memory requirements, aspects that are detrimental in scenarios demanding swift data processing and limited computational resources. These limitations highlight the necessity for novel strategies to balance compression efficiency, speed, and resource consumption to facilitate more feasible and practical applications in large-scale genome sequencing projects. Alternative methods, such as CMIC [3], opt for a different balance, slightly relinquishing the compression ratio to gain random access functionality, a feature indispensable for a wide array of tasks. This type of approach is intriguing; it provides a distinct perspective on tackling genome compression challenges, offering a mix of functionality and efficiency. In light of this, our work introduces an alternative approach, discussed in section 4.2.2, that guarantees a fixed compression ratio of 3x. This proposed method not only preserves the vital characteristic of random access but also stands out for its simplicity and unprecedented speed in the compression process, making it an attractive option for tasks that require rapid and efficient handling of genomic data. Another category of approaches, such as the one elucidated in "Genome Compression Against a Reference" [4], concentrates on batch compression of genomes. These methodologies leverage the inherent similarities among human genomes, as the genetic variation between individuals is relatively minimal. Hence, these techniques primarily focus on compressing only the sections of the genome data IDDM’2023: 6th International Conference on Informatics & Data-Driven Medicine, November 17 - 19, 2023, Bratislava, Slovakia EMAIL: denys.maletskyi.mknssh.2022@lpnu.ua (A. 1);vshymanskiy@gmail.com (A. 2) ORCID: 0009-0005-4167-4858 (A. 1); 0000-0002-7100-3263 (A. 2) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings that exhibit differences. By doing so, these approaches efficiently handle the voluminous data associated with multiple genomes, ensuring that the redundancy in the information is effectively minimized, leading to efficient storage and management of genomic datasets. A distinct method is presented in "Cryptography and Reference Sequence Based DNA/RNA Sequence Compression Algorithms" [5], where the authors ingeniously apply principles from one domain to another. In this novel approach, techniques traditionally used in cryptography are employed for the purpose of genome compression. This cross-domain application opens up new possibilities, offering a unique perspective and methodology in addressing the challenges of genome data compression. Through the integration of cryptographic principles, the proposed approach ensures efficient compression. It introduces innovative mechanisms for securing and managing genomic data, thereby broadening the horizon of genome data compression strategies. The surveyed approaches underscore the swift advancement unfolding within the genome compression research sector. Observing this rapid progression, it is pivotal to introduce innovative strategies that further the field’s capabilities. In response to this imperative, we put forward a novel approach in our study that incorporates a technique hitherto unexplored in the realm of genome compression. This innovative method centers on program synthesis, particularly emphasizing equality saturation for compression tasks. The employment of program synthesis, and more specifically, the equality saturation approach, is posited to offer commendable performance in genome compression endeavors, opening new avenues for efficient and effective genomic data handling and analysis. 2. Background Data compression is pivotal for minimizing data size, facilitating efficient storage, and expediting transmission. It's essential for reducing costs and quickening the pace of data analysis and sharing. Data compression methodologies fall into two categories: lossless and lossy. Lossless compression ensures that the original data's integrity is maintained upon decompression, whereas lossy compression results in some data loss but often attains higher compression rates 1. Shannon’s entropy [6] is integral to data compression, providing a theoretical limit to the best possible lossless compression ratio achievable by any compression algorithm. Grasping the concept of Shannon's entropy is vital as it provides insights into data's compressibility, steering the creation of more efficient algorithms. Shannon’s entropy is calculated using the formula: 𝑛 𝐻(𝑋) = − ∑ 𝑃(𝑥𝑖 )𝑙𝑜𝑔2 𝑃(𝑥𝑖 ) (1) 𝑖=1 Traditional compression algorithms, including Huffman encoding [7], Arithmetic encoding [8], Run-Length Encoding (RLE) [9], and dictionary-based methods like LZW algorithm [10] or GZIP [11], serve as foundational knowledge for this discussion. Huffman encoding generates variable-length codes for each symbol based on their frequencies, while Arithmetic encoding represents a sequence of symbols as a single number. Both RLE and ZIP exploit redundancy in data for compression. 2.1. Genome compression challenges Genome data compression is daunting due to the genomic dataset's unique and voluminous nature. Here are some of these challenges: Volume: A single human genome encompasses billions of base pairs, resulting in substantial data volumes, necessitating effective compression algorithms5. Diversity: The inherent diversity in genomic data means that variations between genomes are vast, necessitating algorithms that can efficiently handle a broad data spectrum without losing crucial information6. Noise: Sequencing errors and N's (unknown nucleotides) in the data introduce noise, complicating the compression process7. Accessibility: The compressed data should be easily retrievable for further analysis and research without imposing significant computational overheads8. Addressing these challenges is crucial for devising effective genome data compression algorithms. 2.2. Program synthesis overview Program synthesis is the process of automatically generating programs based on a given specification. Instead of manually crafting algorithms, developers can provide requirements, and a synthesizer outputs a program matching these criteria [7]. Program synthesis, in essence, facilitates automating parts of the programming process. There are multiple approaches to program synthesis: 1. Deductive Synthesis: Starts with a high-level specification and refines it step-by-step into a full-fledged program [12]. 2. Stochastic Synthesis: Uses probabilistic methods to explore the program space, often making use of genetic algorithms [13]. 3. Syntax-Guided Synthesis (SyGuS): Defines the semantic constraint the desired function must satisfy and a syntactic constraint restricting the space of allowed implementations [14]. A fundamental concept tied with program synthesis is Kolmogorov complexity [15], which quantifies the computational resources needed to specify a string or piece of data. For a string s, its Kolmogorov complexity 𝐾(𝑠) is the length of the shortest possible program (in some fixed programming language) that, when run, produces s as output [16]. 𝐾(𝑠) = 𝑚𝑖𝑛𝑝∈𝑈(𝑝) = |𝑝| (2) Where U is a universal Turing machine, and p is a program such that 𝑈(𝑝) = 𝑠. Equality saturation is an approach wherein a given program's many equivalent forms are explored simultaneously. It employs e-graphs to represent a large set of programs compactly. E-graphs allow for finding rewrites in parallel, enabling the synthesizer to search for the most optimal (shortest or fastest) program that satisfies the same specification as the original [17]. 3. Methodology The Methodology chapter outlines the research design, detailing the specific methods and procedures for data collection, analysis, and experimentation. This chapter serves as a blueprint for the entire research process, providing readers with a comprehensive understanding of the approach to addressing the research questions. 3.1. Analysis of Entropy in Human Genome Data The data under consideration in this research is the human genome, primarily composed of four nucleotides: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). However, it's crucial to note that biological data acquisition, particularly the digitalization of DNA genome sequences, is not entirely error-free. As a result, certain parts of the genome are marked as 'N' to indicate uncertainties or errors in sequencing. These 'N' markers introduce additional complexity when analyzing the genomic data. The entropy of the human genome is evaluated through the following steps: 1. Average Frequency Calculation: For each nucleotide s ∈ {A, C, G, T, N}, the average frequency fs is calculated based on its occurrences in the entire dataset. 2. Average Probability Calculation: The average probability 𝑃(𝑠) of each nucleotide is then determined by normalizing its frequency fs by the total number of nucleotides in the dataset. 3. Entropy Calculation: Using Shannon's entropy formula (Shannon, C.E., 1948), the entropy H is then calculated as 𝐻= ∑ 𝑃(𝑠)𝑙𝑜𝑔2 𝑃(𝑠) (3) 𝑠∈{𝐴,𝐶,𝑇,𝐺,𝑁} Based on the analysis conducted, the average probabilities for each of the nucleotides in the dataset demonstrated in Table 1: Table 1 Calculated probabilities of each symbol in the human genome based on their frequencies in a dataset. Nucleotide A C G T N P(X) 0.279 0.194 0.195 0.280 0.048 Using these probabilities, Shannon's entropy H can be calculated using the formula: 𝐻 = −(𝑃(𝐴)𝑙𝑜𝑔2 𝑃(𝐴) + 𝑃(𝐶)𝑙𝑜𝑔2 𝑃(𝐶) + 𝑃(𝐺)𝑙𝑜𝑔2 𝑃(𝐺) + 𝑃(𝑇)𝑙𝑜𝑔2 𝑃(𝑇) (4) + 𝑃(𝑁)𝑙𝑜𝑔2 𝑃(𝑁)) Substituting the given probabilities: 𝐻 = −(𝑃(0.279)𝑙𝑜𝑔2 𝑃(0.279) + 𝑃(0.194)𝑙𝑜𝑔2 𝑃(0.194) + 𝑃(0.195)𝑙𝑜𝑔2 𝑃(0.195) (5) + 𝑃(0.280)𝑙𝑜𝑔2 𝑃(0.280) + 𝑃(0.048)𝑙𝑜𝑔2 𝑃(0.048)) 𝐻 ≈ 2.162 𝑏𝑖𝑡𝑠 (6) Therefore, the calculated entropy of the dataset based on the given probabilities is approximately 2.162 bits. This value serves as a theoretical lower limit for the average number of bits required to represent each nucleotide in the human genome for the given dataset. It provides a benchmark for evaluating the effectiveness of different compression algorithms applied to this specific genomic data. By understanding the entropy inherent in the human genome, particularly with the considerations of data errors denoted by 'N', we can better gauge the challenges and limits of compressing this form of biological data. It sets the stage for exploring advanced compression algorithms tailored for genomic data. 3.2. Reducing Entropy Through Data Transformation 3.2.1. Entropy Calculation for Combination-based String One of the pivotal ideas in data compression is that entropy can change based on how data is represented. For instance, consider a simple transformation where consecutive identical symbols are replaced with an integer representing the length of the sequence which is also important with low technical characteristics of the system [18] and optimization algorithms using parallel computing [19]. The entropy H′ of this new data would typically be greater than the entropy H of the original data due to the increased number of unique symbols (combinations) in the transformed string. However, because these new unique symbols represent 2-symbol combinations from the original string, the effective bits 𝐻′ per original symbol would be , which could be more efficient for compression. 2 3.2.2. Focusing on Programmatic Representations This article aims to shift the spotlight from conventional transformations to an alternate paradigm: representing data as a program. Here, the concept of Kolmogorov complexity comes into play, which posits that the complexity of a data string can be defined as the length of the shortest possible program that can generate that string. Thus, if we can find a shorter programmatic representation, we can effectively lower the complexity and, thereby, the entropy of the original data. To illustrate this, let's consider the string "AAAAAAAACCCCCCCC". The entropy of this data string, considering its highly repetitive nature, would be: 𝐻 = −(𝑃(𝐴)𝑙𝑜𝑔2 𝑃(𝐴) + 𝑃(𝐶)𝑙𝑜𝑔2 𝑃(𝐶)) = 1 𝑏𝑖𝑡 (7) Compression algorithms that focus solely on entropy, like Huffman encoding, would be highly effective here, compressing the data to just 16 bits: "0000000011111111". Now, consider a Python representation of the same string: 'A' + 'A' + 'A' + 'A' + 'A' + 'A' + 'A' + 'A' + (8) 'C' + 'C' + 'C' + 'C' + 'C' + 'C' + 'C' + 'C' By applying mathematical axioms, we can rewrite this program into a shorter one: 'A' * 8 + 'C' * 8 (9) Going further, this can be rewritten as: ('A' + 'C') * 8 (10) This can then be further reduced to its Reverse Polish Notation (RPN): AC+8* (11) When we apply Huffman encoding to this RPN representation, we get a string with an average code 12 length L of = 0.75 𝑏𝑖𝑡𝑠, which is indeed less than the original entropy H = 1 bit of the original 16 string. This example vividly demonstrates that by shifting our focus to different data representations—in this case, programmatic representations—we can achieve even more efficient compression than what would be suggested by the original data's entropy. This opens up a new frontier for data compression techniques, especially in specialized fields like genomic data compression. 3.3. Program Synthesis for Data Transformation 3.3.1. Introduction to the Problem Space We have established that altering data representations can reduce entropy, enhancing the compression ratio. Program synthesis, particularly focusing on the method of equality saturation, offers a powerful means to discover optimal programs capable of regenerating the original data. Tools like EGG (E-Graphs Good) [20] provide robust data structures designed for efficient exploration of possible rewrites for such programs. However, the naive approach of converting an entire genomic sequence into a complex mathematical construct—as illustrated in the previous section—is computationally untenable. This method would require enormous processing power and time, rendering it practically useless for large-scale genomic data. 3.3.2. The Challenges of Sequence Length One could consider a more feasible approach, where all possible combinations of a fixed length n nucleotides are identified, and an optimal program is sought for each combination. Yet this approach presents its own set of challenges, the most formidable of which is computational complexity. Figure 1: Visualization of exponential growth of all possible combinations of nucleotides of length n. The total number of such combinations would be 5n, where n is the length of the nucleotide sequence under consideration (Fig. 1). Because the number of combinations increases exponentially with n, we would quickly hit a computational ceiling. This is a significant issue because longer sequences generally offer greater opportunities for effective compression, making it imperative to find a strategy that allows us to consider long sequences without prohibitive computational costs. 3.3.3. Toward a More Scalable Approach Given these challenges, it becomes evident that a different strategy must be devised. One that enables us to explore longer nucleotide sequences without overwhelming computational resources while also effectively lowering the entropy of the genomic data to achieve a high compression ratio. To develop a more efficient method, we can capitalize on the similarities between different nucleotide combinations in the context of program synthesis. Consider sequences like "AAAA" and "CCCC"; their programmatic representations would be quite similar, e.g., ‘A’ × 4 and ‘C’ × 4. These similarities hint that we might not need to consider every possible combination in the search space, offering a potential route to computational savings. 3.3.4. Run-Length Encoding as a Metric One effective metric for gauging the similarity between sequences is Run-Length Encoding (RLE) (Tsukiyama, 1996). When we apply RLE to a sequence, we reduce it to a series of counts representing each distinct element's lengths of runs. For example, the sequence "AAAA" would be transformed into the count (4), and "CCCC" would also yield (4). By analyzing the sequences using RLE, we begin to notice that many combinations share identical or similar counts. These counts effectively serve as "fingerprints" that allow us to cluster similar sequences together, significantly reducing the search space for program synthesis. By employing a metric like RLE to identify and group similar sequences, we can substantially reduce the number of unique combinations that must be analyzed. This provides a two-fold benefit: first, it allows the computational resources to focus only on distinct "classes" of sequences, making the process more tractable. Second, it still allows us to consider longer sequences without incurring exponential computational costs, thereby offering the potential for more effective compression. 3.3.5. Tailoring the Search Space with Subset Sum To achieve even greater computational efficiency, we can move beyond merely reducing the search space of similar sequences and directly focus on sequences most pertinent to our objective. A compelling avenue for this targeted approach is to examine the unique Run-Length Encoding (RLE) [4] counts that could occur for any combination of nucleotides of size n. To generate these unique RLE counts, we can adapt the well-known Subset Sum problem—a combinatorial problem that aims to find a subset whose sum matches a given target number. In our case, the target number is n, the length of the nucleotide sequence. The standard Subset Sum problem seeks to find a subset of numbers that sum up to a given target. However, for our application, we need to introduce a slight modification: in addition to finding the subset that adds up to n, we also need to consider the permutations of this subset. This is critical because different permutations of the same subset can correspond to different length-equivalent nucleotide sequences. For instance, the RLE counts (2, 1, 1) and (1, 2, 1) will correspond to different sequences, for instance, "AACG" and "CAAG". This modified Subset Sum approach allows us to efficiently generate all unique RLE counts that could exist for sequences of length n. By synthesizing programs only for these unique counts, we can dramatically cut down on the computational resources needed, as we are now focusing solely on the 'representative' sequences that capture all possible RLE patterns for the given length. The benefit is twofold: first, it narrows down the search space to only those sequences that are absolutely necessary for program synthesis, making the approach far more computationally efficient. Second, it does so without compromising the integrity of the analysis, as these unique RLE counts still provide a comprehensive overview of all the possible sequence patterns for a given length. In conclusion, this targeted computation method that utilizes a modified Subset Sum problem to focus on unique RLE counts offers a highly efficient and effective way to apply program synthesis for genomic data compression. By selectively focusing on relevant sequences, we can achieve high compression rates while significantly reducing computational costs. 3.4. Implementing Program Synthesis for Compression In the context of our genome compression task, we concentrate primarily on sequences represented through Run-Length Encoding (RLE) [9]. RLE produces tuples of values and counts, where values denote nucleotides, and counts represent their consecutive repetitions. With our focus narrowed down to these RLE sequences, the complexity of the task is significantly reduced. 3.4.1. Value Encoding Given that RLE sequences inherently lack consecutive identical symbols, we allocate a mere two bits to encode each nucleotide value, effectively representing {'A', 'C', 'G', 'T', 'N'} \ {previos symbol}. This minimalist encoding strategy offers simplicity and efficiency, aiding the overall compression process without adding unnecessary computational overhead. 3.4.2. Count Encoding through Program Synthesis The count component in the RLE sequences presents a more dynamic and diverse range of values, necessitating a thoughtful approach for efficient encoding. In response to this challenge, our methodology employs program synthesis—specifically, the technique of equality saturation utilizing E-Graphs. Equality saturation through E-Graphs [13] provides a robust platform for exploring various equivalent forms of a given count, facilitating the identification of the most efficient representation. The process involves the generation of multiple program alternatives, with each variant being a potential candidate for the optimal program. Through systematic exploration and comparison, the most concise and efficient program is selected for representing a given count. 3.4.3. Program Encoding Upon synthesizing optimal programs, the next step involves their encoding. Here, Huffman encoding serves as a primary tool renowned for its effectiveness in minimizing the length of encoded data. However, in cases where further compression is desirable, we resort to a custom packaging algorithm. This tailor-made algorithm provides an additional layer of compression, subtly improving upon the results achieved through Huffman encoding alone. We construct a comprehensive genome compression framework through this multi-tiered approach, combining value encoding, program synthesis for count representation, and subsequent encoding of synthesized programs. This framework offers promising compression ratios and ensures efficiency and speed in both the compression and decompression stages. 4. Results and Discussion 4.1. Data Set Selection For the purpose of this research, we utilize publicly accessible human genome sequences. These sequences have been obtained from two primary repositories: the National Center for Biotechnology Information (NCBI) [21] and the Genome Aggregation Database (gnomAD) [22]. The selection of data for this study has been influenced by specific criteria, which we outline below: 1. Completeness It is imperative that the genome data used for this research be as complete as possible. Completeness in this context means that the sequences are well-annotated, with all regions clearly marked and identified. Additionally, the data should be free from missing values. Missing values could potentially skew the outcomes and introduce unnecessary noise into the program synthesis and compression process. Therefore, only datasets that fulfill these stringent requirements of completeness are included in the study. 2. Diversity Given that the human genome exhibits natural variations across different ethnic and geographical backgrounds, including as much diversity as possible in our dataset is essential. By incorporating genomic sequences from various ethnic backgrounds, the study aims to account for this natural variability. This ensures that the developed compression techniques are generalizable and effective across a broad spectrum of genomic data. In adhering to the completeness and diversity criteria, we have compiled a dataset consisting of 1000 diverse human genomes. Each genome in this dataset is meticulously selected to represent a variety of ethnic and geographical backgrounds, ensuring a comprehensive and inclusive data pool. The dataset exhibits a balanced distribution of nucleotides. On average, each genome is approximately 3.07 GB in size. Given this, the cumulative size of our dataset is approximately 3.1 TB, providing a substantial data pool for robust testing and validation of the proposed compression techniques. Figure 2: An analysis of nucleotide distribution within the dataset. The probability of having 'N' is relatively low, with a value of 0.048. Given this low probability, the focus should be placed on efficiently compressing the primary nucleotides (A, C, G, T) with higher occurrence rates. Furthermore, an analysis of the nucleotide distribution (Fig. 2) within the dataset reveals an average composition reflective of general human genome characteristics, providing a solid foundation for developing and testing our program synthesis-based compression techniques. 4.2. Benchmarks algorithms In the initial phase of this study, it is crucial to establish a strong baseline against which the efficacy of the proposed program synthesis-based compression techniques can be measured. To that end, we have implemented and tested several classical compression algorithms to serve as benchmarks. These include Huffman and Arithmetic encoders and Run-Length Encoding (RLE). 4.2.1. Classical Algorithms The classical algorithms, such as Huffman and Arithmetic encoders, provide a foundational understanding of the limits and potentials of general-purpose compression techniques. The RLE algorithm was also included to cover a broader spectrum of existing compression methods. 4.2.2. Custom Packing Algorithm In addition to these well-established algorithms, we have also developed a custom packing approach specifically tailored for genomic data. The crux of the custom packing algorithm revolves around the ability to compress a combination of three nucleotides, along with the error marker 'N', into just a single byte, thereby achieving a 3-fold compression ratio and an average code length of 2.667 bits. This approach is noteworthy for achieving a 3-fold compression ratio while maintaining the critical feature of O(1) random access. This implies that any segment of the compressed genomic data can be accessed in constant time, which is a significant advantage for many genomic applications. The custom packing approach presents enormous potential for parallelization. Initial tests have shown compelling results in terms of both time efficiency for encoding and decoding operations. This makes it a strong candidate for comparison with the program synthesis-based techniques developed in this thesis. To achieve this, the byte is split into distinct sections of bits, where the first two bits act as a counter for the 'N' symbol's occurrences within a nucleotide triplet. Depending on the counter's value, the remaining six bits are parsed differently: ● For N = 0: Here, XX, YY, and ZZ each represent one of the nucleotides in the set {A, C, G, T}, using two bits per nucleotide. ● For N = 1: Here, II specifies the position of 'N' in the triplet, and XX and YY encode the other two nucleotides, In this case, II represents the position of the non-'N' nucleotide, which XX encodes, and the last two with two bits for each. ● For N = 2: bits are not used. ● For N=3: All symbols in the triplet are 'N', so the six bits following the counter are disregarded: Here, XX, YY, and ZZ each represent one of the nucleotides in the set {A, C, G, T}, using two bits per nucleotide. This structured bit-partitioning efficiently represents nucleotide triplets while considering the prevalence of 'N' symbols, streamlining the encoding and decoding processes. Through this bit-partitioning strategy, the custom packing algorithm is not only efficient in terms of compression ratio but also maintains O(1) random access, which is pivotal for many genomic data applications. It is worth mentioning that this algorithm does not require any additional lookup table or tree for data decoding, further enhancing its performance and making it highly suited for real-time applications. 4.3. Metrics for Evaluation Evaluating the efficiency of data compression algorithms necessitates the use of precise and reliable metrics. This research utilizes two primary metrics: Average Code Length and Compression Ratio. These metrics are crucial for quantifying the performance of the implemented compression algorithms and providing an objective comparison among them. 4.3.1. Average Code Length (L) The Average Code Length is a vital metric that reflects the average length of the codes generated by a compression algorithm. For a set of symbols S with respective probabilities P, the Average Code Length (L) is calculated as: 𝑛 𝐿 = ∑ 𝑝𝑖 ⋅ 𝑙𝑖 (12) 𝑖=1 Where: 𝑝𝑖 : the probability of the 𝑖 𝑡ℎ symbol in the dataset. 𝑙𝑖 : length of the code assigned to the 𝑖 𝑡ℎ symbol. 𝑛: total number of unique symbols in S. The average code length provides insights into the efficiency of a compression algorithm. Ideally, the average code length should be as close as possible to the entropy of the source to achieve optimal compression. 4.3.2. Compression Ratio (R) Compression Ratio is another crucial metric that measures a compression algorithm's reduction in data size. It is defined as the ratio of the uncompressed data size (𝑈) to the compressed data size (𝐶). The formula for calculating the Compression Ratio is 𝐶 𝑅 = (13) 𝑈 Where: 𝑈: Size of the uncompressed data. 𝐶: Size of the compressed data. A higher compression ratio indicates more efficient data reduction. However, it is essential to consider that a higher compression ratio does not always imply better compression, as the quality and readability of the compressed data must also be maintained. 4.4. Results Comparisons The algorithms were tested on a custom dataset of human genomes, meticulously curated and discussed in previous sections, to ensure a balanced representation of various genomic sequences. Each genome in this dataset represents a unique and diverse genetic structure, offering a challenging and comprehensive ground for testing the efficiency of the algorithms. For each genome in the dataset, we applied the various compression algorithms: Entropy, Huffman, RLE (Run-Length Encoding), Custom Packing, and Program Synthesis combined with Huffman encoding (PS + Huffman). The Average Code Length (L) and Compression Ratio (R) were computed for each algorithm on every genome sequence, and the results were then averaged over the entire dataset to obtain a more generalized and robust understanding of each algorithm’s performance. Table 2 Comparison of benchmarks and proposed algorithm Approach Average Code Length (L) Compression Ratio (R) Entropy 2.039 3.923 Huffman 2.241 3.569 RLE 5.633 1.428 Custom Packing 2.667 3.000 PS + Huffman 2.154 3.714 The proposed approach (PS + Huffman) emerges as the most efficient method among those analyzed, closely approximating the theoretical entropy and achieving the highest compression ratio. Its success can be attributed to the effective combination of program synthesis for count representation and Huffman encoding, which collectively optimize the compression process. In comparison which is demonstrated in Table 2, while the Huffman algorithm and custom packing method exhibit commendable performances, they do not reach the efficiency level of the proposed approach. Although RLE is a straightforward method, it significantly lags in terms of compression ratio and code length, deeming it unsuitable for optimal genome data compression in this context. 5. Conclusions This article has presented promising results for genome data compression through the innovative use of program synthesis, specifically employing the equality saturation technique. Our approach offers substantial improvements in compression ratio and ensures a reduction in average compression time, proving its effectiveness and efficiency in handling genome data. The method introduced applies program synthesis meticulously to extract and efficiently encode optimal programs for different sections of genome data. This careful application bridges the gap between theoretical potential and practical utility, resulting in a robust and efficient compression technique. Future work can enhance this approach by incorporating deeper insights into the intrinsic characteristics of genome data. The encoding phase outlined in this study also offers scope for refinement. By adopting advanced encoding strategies and integrating domain-specific knowledge, we anticipate the development of a more powerful and refined technique for genome data compression. In conclusion, this paper marks a significant step forward in the dynamic field of genome data compression, opening up new possibilities for further research and innovation in developing more efficient compression techniques. 6. Acknowledgments I would like to express my deep gratitude to Lviv Polytechnic National University, Lviv, Ukraine, for their unwavering support and provision of essential resources necessary for conducting this research. The university has provided a conducive and stimulating environment, which has been pivotal in the successful completion of this study. In addition, a sincere acknowledgment is due to the Ukrainian Armed Forces. The opportunity to undertake and complete this research amidst these challenging times is largely attributed to their resilient spirit and unfaltering commitment to peace and stability in Ukraine. 7. References [1] A. Al Ali, et al. “CRAM compression: practical across-technologies considerations for large-scale sequencing projects,” bioRxiv, pp. 2022–12, 2022. [2] G. Dufort Álvarez, et al. “RENANO: a REference-based compressor for NANOpore FASTQ files,” Bioinformatics, vol. 37, no. 24, pp. 4862–4864, 2021. [3] H. Chen, et al. “CMIC: an efficient quality score compressor with random access functionality,” BMC Bioinformatics, vol. 23, no. 1, p. 294, Dec. 2022, doi: 10.1186/s12859-022-04837-1. [4] A. Laud, et al. “Genome Compression Against a Reference,” arXiv preprint arXiv:2010.02286, 2020. [5] S. P. Daggubati, et al. “Cryptography and Reference Sequence Based DNA/RNA Sequence Compression Algorithms,” Journal homepage: http://iieta. org/journals/isi, vol. 27, no. 3, pp. 509– 514, 2022. [6] Chanda, P., et al. "Information Theory in Computational Biology: Where We Stand Today". Entropy, 2020, 22, 627. https://doi.org/10.3390/e22060627 [7] Alyami, Sultan, Chun-Hsi, Huang. "Nongreedy Unbalanced Huffman Tree Compressor for Single and Multifasta Files". Journal of Computational Biology 27. 6(2020): 868-876. [8] Jan Voges, et al., "GABAC: an arithmetic coding solution for genomic data", Bioinformatics, Volume 36, Issue 7, April 2020, Pages 2275–2277, https://doi.org/10.1093/bioinformatics/btz922. [9] S. Fiergolla, P. Wolf. "Improving Run Length Encoding by Preprocessing," 2021 Data Compression Conference (DCC), Snowbird, UT, USA, 2021, pp. 341-341, doi: 10.1109/DCC50243.2021.00051. [10] Y. -L. Tsai and J. -J. Ding. "An Improved LZW Algorithm for Large Data Size and Low Bitwidth per Code," TENCON 2021 - 2021 IEEE Region 10 Conference (TENCON), Auckland, New Zealand, 2021, pp. 203-208, doi: 10.1109/TENCON54134.2021.9707201. [11] Maël Kerbiriou, Rayan Chikhi. "Parallel decompression of gzip-compressed files and random access to DNA sequences." (2019). [12] Gulwani, S. “Dimensions in Program Synthesis.” Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming, 2010. [13] Mishra, Ashish & Jagannathan, Suresh. "Specification-Guided Component-Based Synthesis from Effectful Libraries", 2022. 10.48550/arXiv.2209.02752. [14] Edward Pantridge, Thomas Helmuth. "Solving Novel Program Synthesis Problems with Genetic Programming using Parametric Polymorphism." Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 2023. [15] Julian Parsert, Elizabeth Polgreen. "Reinforcement Learning for Syntax-Guided Synthesis.", 2023. [16] Goldberg, Halley, Valentine, Kabanets. "Improved Learning from Kolmogorov Complexity." Proceedings of the Conference on Proceedings of the 38th Computational Complexity Conference. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2023. [17] Alaskandarani, Kamaludin. "Low Complexity, Low Probability Patterns and Consequences for Algorithmic Probability Applications". Complexity, 2023 : 9696075. [18] Semkovych, V.; Shymanskyi, V. "Combining OCR Methods to Improve Handwritten Text Recognition with Low System Technical Requirements". Lecture Notes on Data Engineering and Communications Technologies, 2023, pp. 693–702. https://doi.org/ 10.1007/978-3-031-24475- 9_56. [19] Lesia Mochurad. “Optimization of Regression Analysis by Conducting Parallel Calculations”. COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine, pp. 982-996. [20] Zhang, Yihong, et al.. "Better Together: Unifying Datalog and Equality Saturation", 2023. [21] National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [2023]. Available from: https://www.ncbi.nlm.nih.gov/ [22] Genome Aggregation Database (gnomAD)[Internet]. (RRID: SCR_014964). Available from: http://gnomad.broadinstitute.org/