Proceedings of the 11th International Conference on Applied Informatics Eger, Hungary, January 29–31, 2020, published at http://ceur-ws.org Planning Safety Solutions, Models and Algorithms for Special Databases∗ Fanni Magdolna Hízóa , Attila Kissb a ELTE Eötvös Loránd University, Faculty of Informatics, Budapest, Hungary fanni.hizo@gmail.com b ELTE Eötvös Loránd University, Faculty of Informatics, Budapest, Hungary kiss@inf.elte.hu Abstract DNA sequencing is the process of determining the order of nucleotides in DNA. The rapid speed of sequencing attained with modern DNA sequenc- ing technology has been instrumental in the sequencing of complete DNA sequences, including the human genome. Nevertheless it is a sensitive data which needs safe but efficient storage methods. The goal in this research was to analyze different models and algorithms to determine which is the most applicable for storage, and query considering the need of user permissions, and encryption. Keywords: bioinformatics, data science, data protection, data compression, database applications MSC: 68P15, 92B05, 97R50, 68P20, 97R30 1. Motivation Due to cheaper methods it is possible for everyone to keep their own DNA on their own device. However these are sensitive data which needs to be secured. Genomes are stored mostly in text files. The size of these files could be even 3GB or more depending on the analyzed species. Secured data management is not solved in these files. By using database managers, the levels of permissions can be easily managed, because many database manager systems have built-in security and encryption possibility. ∗ The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 160 2. Related Works Previous researches are also available about using database manager systems for storing genotypic data, describing and comparing multiple techniques [1]. One uses the database as a container to store bioinformatic file formats [2]. The authors of this paper explores the array-based technique, showing that it scales to far greater volumes than other existing techniques [1]. In this paper the authors construct a DNA relational database with a simple data model in which one DNA molecule stores one piece of data, [3] and this one introduces the database aspects of DNA computing [4]. This paper discuss the Cassandra NoSQL database approach for storing genomic data [5]. In [6] the authors managed to implement a web interface, to help students to develop informatic thinking skills. When considering possible encryption methods for data security, one of them is the 64-bit block ciphers (e.g Blowfish, 3DES), but these ciphers are known to be susceptible to collision attacks [7]. On the other hand the AES-128 uses 128-bit blocks and gives better protection [8]. 3. DNA processing 3.1. DNA Deoxyribonucleic acid (DNA) is a molecule composed of two chains that coil around each other to form a double helix carrying genetic instructions. Nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life. Each nucleotide is composed of one of four nitrogen-containing nucleobases (cytosine [C], guanine [G], adenine [A] or thymine [T]). DNA sequencing is the process of determining the nucleic acid sequence. In this paper we worked with these sequences. 3.2. MySQL database Firstly we created a database in local host, followed by a database table. We split the sequences into 1680-character-long pieces and uploaded many different DNA sequences into the table. CREATE TABLE dna ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, seqPosition INT NOT NULL, sequence TEXT NOT NULL ); Only with this structure, a simple pattern searching SQL query took about 5-10 minutes in MySQL [9]. SELECT * FROM dna 161 where seq like "%GACCGCGGCGCCGAGCGGCAGCCGCGCCGGCCCGGAC%" and id=101; 1 row(s) returned 550.953 sec / 0.000 sec 3.3. PostgreSQL database We did the same in PostgreSQL. CREATE TABLE dna ( id SERIAL PRIMARY KEY NOT NULL, seqPosition INT NOT NULL, sequence TEXT NOT NULL ); However in PostgreSQL [10] a pattern searching query has never took more than 5 minutes. 3.4. Indexing A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators. The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character. MySQL uses the Turbo Boyer-Moore [11] algorithm to initialize the pattern for the string and then uses this pattern to perform the search more quickly. We examined the query runtime after creating indexes for the table columns. As we mentioned before, a simple selection statement could took minutes to complete. After creating indexes in the database columns, the following statement returned within seconds. SELECT * FROM dna where pos=111; 10 row(s) returned 0.110 sec / 0.015 sec 4. K-mers In bioinformatics, k-mers are subsequences of length k contained within a biolog- ical sequence. Primarily used within the context of computational genomics and sequence analysis. Usually, the term k-mer refers to all of a sequence’s subsequences of length k, such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). The maximum number of variations of the possible substrings are 4𝑘 . 4.1. Method We developed a program in python to determine the number of k-mers. With reading the file line by line it counts the occurring nucleotid variations. We examined human, hominoidea, plant and bacteria dna sequence too. 162 Figure 1: Human monomer Figure 2: Bonobo monomer The next diagram shows the difference in percentage between each of the genom 3-mers self-compared value of occurrence. 163 Figure 3: Comparison of a human genom and a bonobo genom 3-mer 4.2. Multiprocessing We had an idea to use multiprocessing to increase the program efficiency. The basic idea was that if we split the file into smaller pieces, (for example four equal sized files) and if we add each piece into a thread, it should reduce the runtime. In python, multiprocessing is a package that supports spawning processes using an API similar to the threading module. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows. The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). We get the following results without and with multiprocessing: The runtime with a 3gb sized dna sequence is more than 24 minutes (1440.026 sec) without parallelism. The runtime with the same, but divided file is less than 10 minutes (585.83 sec) with parallelism. This means a 59% performance improvement. For a better representation we tested our program on a 820MB sized DNA sequence. At first we did not use multiprocessing, then split the file into 3/6/11 equal pieces and used multiprocessing for our calculation. The following figure represents the taken time by each run. 164 Figure 4: Multiprocessing 5. Compression The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977. It is used by 7zip nowadays [12, 13]. We decided to use this algorithm to compress our data because it features a very high compression ratio. The algorithm creates a dictionary while encodes the data and the fact that we are working only four characters provides us a small dictionary. With this method a 3GB sized DNA sequence compressed size is only 831 MB which means more than 73% storage space saving. 6. Safety solutions In discussing security, it is necessary to consider fully protecting the entire server host (not just the server) against all types of applicable attacks: eavesdropping, altering, playback, and denial of service. Both MySQL and PostgreSQL have the option to create and add privileges with different roles to the database. Which provides the opportunity to determine which user could access the database and which could edit it. Every fresh MySql and PostgreSql system has an already defined role, which is always the superuser. 165 6.1. MySQL The privilege system is the most basic security feature MySQL has to offer. It allows defining who can access database. Each user may be assigned a number of privileges. They define what each user can do once they have been authorized to connect. The privileges can be either global, granting specific right without restrictions, or limited to a specific object such as database or table. MySQL implements an SSL library to provide connection encryption. All end- points – server and clients – require their own set of private keys and certificates to be able to use encryption. It offers a number of cryptographic [14] and hashing functions that can be used to either encrypt or decrypt data directly in queries – AES_ENCRYPT(), AES_DESCRYPT(), DES_ENCRYPT(), DES_DECRYPT(), SHA(), etc. 6.2. PostgreSQL Same as in MySQL, PostgreSQL also use user roles to manage access privileges. Roles can act as users, groups or both. Roles can own database objects and can manage the permissions on these objects. Both of the discussed database manage- ment systems the owner of a created object is the user who created it. PostgreSQL has native support for SSL connections to encrypt client/server communications for increased security. We would like to mention that it also has the possibility to encrypt data, both one way (e.g passwords) or two way(e.g credit cards). For one way encryption, the crypt function packaged in pgcrypto provides an added level of security above the md5 way. We have several ways to accomplish the two way encryption, these implements the encryption part of the OpenPGP(RFC 4880) standard. However we did not try these methods out. 7. Conclusion We created an environment where we can manage the user privileges and store the data securely and encrypted. The database manager gives us many options to easily add new users and create user hierarchy, from the principle of least privilege to a user with admin authority. We also managed to implement the first basic version of our processing tool that gives us the opportunity to compare different species genomes. Even started to test the multi-threaded version of the program for faster process of the genomes. 8. Future Work In the future we want to try new data storage options and compare our results with other database managers, e.g NoSQL. As our reviewers insisted it would be interesting to examine another compression algorithms, and compare them to our 166 results. Moreover, the processing optimization and performance improvement is in our main scope. Implementing a user friendly interface with options to import genomic data, which also capable of creating different graphs and reports based on the results is also one of our main goals. References [1] Lichtenwalter, R.N., Zorina-Lichtenwalter, K., Diatchenko, L. Genotypic data in relational databases: efficient storage and rapid retrieval, In: Kirikova, M., Norvag, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS vol. 10509 (2017), 408– 421. [2] Uwe Roehm, Jose Blakeley Data management for high-throughput genomics (2009), https://arxiv.org/abs/0909.1764 [3] Yamamoto M., Kita Y., Kashiwamura S., Kameda A., Ohuchi A. Develop- ment of DNA Relational Database and Data Manipulation Experiments, In: Mao C., Yokomori T. (eds) DNA Computing. DNA Lecture Notes in Computer Science vol. 4287 (2006), 418–427. [4] Gillis J.J.M., Van den Bussche J. A Formal Model for Databases in DNA. In: Horimoto K., Nakatsui M., Popov N. (eds) Algebraic and Numeric Biology. Lecture Notes in Computer Science vol. 6479 (2012), 18–37. [5] Aniceto, R., Xavier, R., Guimarães, V., Hondo, F., Holanda, M., Wal- ter, M.E., Lifschitz, S. Evaluating the Cassandra NoSQL database approach for genomic data persistency, Int. J. Genomics 2015, 1–7. [6] Rice, M., Gladstone, W., Weir, M., Relational databases: a transparent frame- work for encouraging biology students to think informatically, Cell Biol. Educ. 3(4) (2004), 241–252. [7] Bhargavan, K., Leurent, G., On the practical (in-) security of 64-bit block ci- phers: collision attacks on HTTP over TLS and OpenVPN, In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), 456–467. [8] Bogdanov, A., Khovratovich, D., Rechberger, C. Biclique cryptanalysis of the full AES, In: Lee, D.H., Wang, X. (eds.) ASIACRYPT 2011. LNCS, vol. 7073 (2011), 344–371. [9] MySQL Documentation, https://dev.mysql.com/doc/ [10] PostgreSQL Documentation, https://www.postgresql.org/docs/ [11] Jorma Tarhio, Esko Ukkonen Approximate Boyer–Moore String Matching, SIAM Journal on Computing 22(2) (1993), 243–260. [12] Lempel–Ziv–Markov chain algorithm, https://en.wikipedia.org/wiki/Lempel- Ziv-Markov_chain_algorithm [13] A. D. Wyner and J. Ziv Fixed data base version of the Lempel-Ziv data com- pression algorithm. IEEE Transactions on Information Theory vol. 37(3) (1991), 878–880. [14] Qiu, Manying, and Steve Davis Database Security Mechanisms and Implemen- tation Issues in Information Systems vol. 3 (2002), 529–534. 167