Designing a DBMS Development Course with Automatic Assignment Evaluation Viacheslav Galaktionov1, 2 George Chernishev1, 2, 3 1 JetBrains Research 1 JetBrains Research, 2 Saint-Petersburg State University 2 National Research University Higher School of Economics, Saint-Petersburg, Russia 3 Saint-Petersburg State University, viacheslav.galaktionov@gmail.com Saint-Petersburg, Russia g.chernyshev@spbu.ru Abstract—Due to the constantly growing amount of data in the 2) Advanced courses, whose main task is to teach students world, we need better ways to process it. Conducting research to actually develop DBMSes. A lot of attention is paid and development in this area requires skilled workforce. Different to the internals of one or several classes of systems, as universities provide different courses to prepare people for this line of work. well as the most important algorithms. Student either In this paper we present our approach to conducting practice develop their own system from scratch or modify an sessions within a DBMS development course. We describe some existing one. of the approaches implemented by other universities, outlining Evidently, these two categories serve different purposes and their advantages and disadvantages. A popular approach is to provide students with a prototype of some DBMS and let them as such have to be taught differently. Let us concentrate on incrementally improve it by completing certain tasks. The two the advanced courses in this paper. A question arises: how most important problems in these courses are 1) choosing a should one organize such courses? Clearly, just giving lectures DBMS (an industrial or educational one), within which students to the student will not be enough because of the practical should work; 2) deciding whether to employ an automated testing nature of the covered topics. The students need to be given an system, and, if so, which one. In both cases we take a look at several options and justify the necessity to create a new opportunity to apply their new knowledge in order for them to one, which we then describe. In total, we have developed the fully understand the material. This means that special attention following: a base prototype of a row-store query executor, an should be given to practice sessions. In this paper we present automated testing system, a set of problems along with reference our approach to conducting practice sessions within a DBMS solutions and test cases. Finally, we present the results of a test development course. run involving 17 undergraduate students. Index Terms—education courses, education, databases, query The contribution of this paper is the following: engine, query processing, database internals 1) An overview of some of the approaches to conducting practice sessions within advanced database courses used I. INTRODUCTION in different universities. 2) The structure of our approach: the overall idea of the It is well-known that the amount of data that needs to course, the used DBMS prototype, the task set, approach be processed is increasing with an unprecedented speed [1]. to testing students’ solutions. This is mostly related to the emergence of such areas as 3) The results of the test run of our course, a description Big Data, the Internet of Things and cloud computing. The of our experience and the encountered issues. research of existing data storage and processing methods and This paper is structured as follows. In Section II we describe the development of new ones are thus becoming more and various DBMS development courses. Next, in Section III we more important. discuss overall architecture of our approach, and in Section IV Naturally, conducting said research and development re- we enumerate various security measures that we undertook. quires a great amount of highly-qualified workforce. Preparing The syllabus of our course and the idea of proposed tasks is such cadres is an important task that perhaps all universities described in Section V. The Section VI presents the outcome attempt to accomplish. of the first test run, justifies the benefits of our approach and There is a multitude of courses aimed at improving qualifi- describes the encountered issues. The future work and conclu- cations in subjects related to databases. They can be divided sion are presented in Sections VII and VIII, respectively. into two categories: II. RELATED WORK 1) Introductory courses, that explain a specific set of ba- sic terms. Usually such courses consider the classic Conducting advanced database courses is not a new prob- relational model and teach the students to apply it, lem, there are publications describing experience of many uni- but sometimes they can include information on various versities [2]–[6]. The referenced papers provide two different NoSQL systems. viewpoints on how such a course should be organised: 1) Students of the course described in [2] had to modify published by third parties, which makes it easy for students PostgreSQL, a DBMS used in the industry. However, to cheat. The latter, MinSQL, has never been made public. due to complexity of PostgreSQL’s architecture, only Therefore, we have decided to develop our own educational two of the tasks required students to actually modify DBMS. its code. Another important problem is grading students’ work. In the 2) On the other hand, the course described in [3] employed case of complex systems like DBMSes it becomes too difficult a DBMS developed specifically for it, SimpleDB. It was to assess the code by just reading it. Some form of automated made with code clarity in mind, sacrificing performance testing is required. Trusting the student to write their own tests where necessary. This allowed students, who were new is not an option, and handing out tests to students could lead to the subject, to find their bearings in the code and to them writing code that’s designed to pass the tests instead start modifying it. Because of this, the number of of actually solving the problems. programming-related tasks in this course was 9. A more correct and modern approach is to allow the A similar approach is being used at Harvard [4] right students to upload their code to some testing system, which now. There, students implement their own main-memory runs various tests on it. Besides testing for correctness and column-store. They cover such topics as indexing meth- performance, such a system can recognize different kinds of ods optimized for main-memory and shared scan meth- cheating: copying others’ solutions, DoS attack, etc. ods. During the class hours students discuss state of the This approach is very popular with programming contests art research papers. and online courses. In both cases there is a stream of solutions In Russia such courses also exist. In 2004 the South that is too large for a group of people to evaluate in a Ural State University offered a course “Parallel database reasonable time period. systems” [7], where students had to develop their own Of course, automated testing is used outside of these areas prototype of a parallel database management system as well. Code quality is an important characteristic of any using the MPI standard. In the Computer Science software, so it receives a lot of attention. There are industrial Center [5] course “Software engineering for big data” systems for automated code testing. students were offered to implement a distributed key- We have evaluated multiple systems from different areas. value data store and an application. The assignment Let us summarize our findings: encouraged team participation1 and there were 4 tasks 1) Programming contest platforms. We have considered overall. Several years ago Innopolis Univeristy also Yandex.Contest [12] and Codeforces [13], which are the offered [6] a three-week assignment for building a sim- most popular platforms in Russia. These systems are plified relational query engine in Python. This project capable of testing the code for correctness, performance, was aimed for team participation also. To the best of and are also protected against DoS attacks. However, our knowledge, both these courses involved manual they are expect the users to provide a single source checking of the solutions. file. This becomes a problem with complex systems like The second approach appears to be more desirable, as it is DBMSes. It is possible to put the entirety of source code both easier for the students and more saturated, i.e. instead of into one file, however, that is not something we want to simply reading about different algorithms and approaches, the teach our students. students will have to actually implement them. 2) Online course platforms. This area was represented in our research by Stepik [14] and Coursera [15]. They Another advantage of the second approach is that the exhibit the same problem as the programming contest students are given an opportunity to improve their skills platforms. However, they have additional disadvantages. related to systems programming as well as complex system For example, to the best of our knowledge, it is impos- development [8], [9]. sible to create a private Coursera course. Furthermore, Because of the aforementioned reasons we have decided to 3) Industrial testing systems. We have looked at Travis take the second approach. However, that raises the question: CI [16] and Jenkins [17]. While they provide exceptional which DBMS should we use? SimpleDB itself is hardly an testing capabilities, they are not well-suited to track option, since it puts a lot of attention on multi-user operation, students’ progress. which is not very interesting for us but complicates the code Due to the lack of a system that would meet all our somewhat. requirements, we have decided to implement our own. We have considered other systems as well: Minibase [10] and MinSQL [11]. The former provides a great set of features III. TESTING SYSTEM ARCHITECTURE and comes in two version: Microbase, which is freely available Our system is organized as a set of Docker containers. to everyone but has a heavily restricted feature set, and The system provides a web interface, where the students can Minibase, the full version, which is available only to teachers. submit their code for testing and the teacher can track their However, the source code for the full version has already been progress. The main parts of the system are as follows: 1) Web server. The system has two kinds of user accounts: 1 https://github.com/alesavin/csc-bdse for students and for teachers. The interface is structured Figure 1. Interaction between different parts of the system Figure 2. The solution checking process differently depending on who the user is. The student describe how it operates when a student submits a solution. will see a list of tasks, some of which will be marked as First of all, the web server will receive the code archive and completed if they have submitted a solution that passed pass it to the testing queue via its API. Then, the testing queue all tests successfully. The teacher can add and remove will save the archive on disk, and then add two records to the students from the course, make various changes to the database: one that describes the submission itself, and one that problem set, and track the students’ progress. describes the testing result. 2) Database. We use PostgreSQL to store the information The testing queue process contains a set of coroutines, each on students, tasks, and submissions. We also use it to of which periodically queries the database for submissions organize the testing queue, which will be described later. that need to be tested. If there is a solution that is ready to 3) Testing queue. This module acts as a mediator between be tested, the coroutine will get a lock on the corresponding the web server and the testing container. It is also database record, and start a testing container. This container responsible for storing the uploaded solutions on disk. will be limited in its resources such as the amount of RAM or The testing queue provides an HTTP API that is used execution time. These limits are configurable via the teacher’s by the web server whenever a student uploads a new web interface on a per-problem basis. solution or whenever someone wants to download a Once the testing container stops working, the coroutine will previously submitted solution. determine why it stopped and assign the submission one of 4) Testing container. Completely isolated from the rest the following grades: accepted, invalid answer, runtime error, of the network, this module encapsulates building and timeout, compilation error. After that both the grade and the running the code. The entrypoint of that container is a log are saved in the database and the testing queue starts script that will unpack and compile the code and then waiting for a new task. run our testing program. This program will run a set of queries and check their results, meanwhile providing a This process is depicted in Figure 2. Solid arrows represent log that can help the students understand how and why the path that a student’s code archive follows within our their code failed. However, this log is very concise and testing system when it is uploaded. Dash arrows represent the does not provide the student with all information about behaviour of a coroutine that is responsible for checking this errors in order to keep the test data secret. solution. The diagram of interaction between different parts of the system is presented in Figure 1. Note that only the web server IV. SECURITY MEASURES has access to the Internet. The testing system was written in the Go programming Any user facing system should be sufficiently protected language. The net/http and html/template packages from its from different kinds of attacks. Of course, our system is no standard library were used to implement the web interface exception. A lot of attention has been given to protecting the and the testing queue logic. system from malicious behaviour that can be exhibited by Now that we’ve covered the structure of our system, let us some students. Let us go over some of the precautions taken: 1) Our system works only over HTTPS. This helps us Table I prevent traffic-sniffing in order to steal passwords, for Project Details example. Type of work Lines of code Man-hours 2) Testing containers are limited in their resources: RAM, Testing system 2900 20 the number of PIDs, disk space. These limits help us Test cases — 30 Base prototype 1200 5 prevent fork-bombs and memory floods. Reference solution 800 15 3) By leveraging access rights we prevent the student’s Source code post-review — 6 code from being able to write into the log generated All components 4900 76 by the testing program, thus preventing data leaks. 4) To prevent data loss, backup copies of all essential files were made daily. froze the submission three days before the final exam to look through the code. V. SYLLABUS VI. COURSE TEST RUN The course was designed with undergraduate students in mind. Thus, it does not require any specific knowledge on This course has been conducted at the Higher School of the students’ part: they should be experienced programmers Economics in Saint Petersburg for a group of 17 students. with a basic understanding of the C++ programming language All of them were taught this course in the same manner, and be familiar with the UNIX environment. They should also there was no division into experimental and control groups. understand what a database management system, a relational The test run was approved by the administration of the St. data model, and SQL are. Petersburg School of Mathematics, Physics and Computer At the beginning of the course, all students are provided a Science. No personal information was ever retrieved, and prototype of a relational DBMS with minimal functionality: students’ behavior within the course was not matched to students’ university record or any other personal data. The • parsing queries written in a subset of SQL; testing system was deployed on the server provided to us by • a small set of physical operators: data source, filtering, the institution. and nested-loop join; • building a simple query plan that can have at most one A. Quantitative analysis join operation; First of all, let us estimate various metrics related to all • reading table data and printing query results in the CSV components of our system. They are presented in Table I. format. Here, we try to show the effort it took to develop each of The prototype is written in the C++ programming language. the components in lines of code (all involved programming Architecture-wise it is a row-store that follows the Volcano languages combined) and man-hours. The first line describes query processing model. It was provided to the students testing system itself, which was the largest in terms of lines through a GitHub repository [18]. of code. However, in terms of man-hours, the development of A total of 8 tasks have been prepared, each of them aimed at test cases and more importantly calculating their time limits expanding the prototype’s functionality. The main topics were were significantly more demanding. query optimization and execution with read-only workloads. The next three rows describe the properties of the prototype Along with the tasks we have written a reference solution that was handed out to students, our reference implementation, to each task, using the simplest approaches. This reference and an averaged solution. Please note that the reference solution was used during test generation to keep track of the solution was built on top of the base prototype. The last line expected execution time. describes our effort to conduct a source code post-review to At the end of the course students who have completed check for cheating and other possible problems. all tasks will have developed a DBMS with the following Our estimates (from previous years, where manual check- features: ing was performed) show that each task requires at least • block-oriented data processing; two attempts, where each attempt lasts about 10 minutes. • a larger set of physical operators: cross product, merge Therefore, the overall time required to check all problems join and hash join, projection, multiple implementations that are offered in our new course would be approximately of duplicate removal; 10 minutes ∗ 2 ∗ 8 ∗ 17 = 45 hour s. Each practice class session • an optimizer that can optimally select physical operators lasts 90 minutes (once a week) and there are ≈ 14 such and the order of joins given the available RAM amount; sessions in a semester, which gives us 21 hours in total. • a rewriter able to simplify the predicates and recognize Therefore, it would be impossible to check all these solutions the inconsistent ones. without involving additional reviewers. The students were supposed to work on their own. Weekly Moreover, this number of additional man-hours required to seminars were held so that they could consult the instructor prepare this course has been compensated by the following: regarding task formulations and any technical problems they 1) Flexibility. Students can check their solutions at any encountered. In order to check that there was no cheating, we arbitrary time and can work at their own pace. Further- more, they can attend class sessions only if they have addition to fixing the problems mentioned in Section VI, we questions. plan to do the following: 2) Quality. We have improved the quality of the course • Improve the user interface of our testing system, both in by introducing precooked test queries and answers. This terms of functionality and visual design. As an example removes the possibility of instructors forgetting to check of missing functionality, the system might benefit from some particular test cases. having an option to let the students ask questions so that 3) Reusability. The testing framework we have developed the teaching assistants could answer them. can be reused during the next iterations of the course. • Transfer the code submission process from an archive- 4) Scalability. Using an automated testing systems makes based one to a git-based one. We expect that to bring it significantly easier to increase the number of students multiple improvements. First of all, the process should in the future runs. become easier for students. Secondly, experience shows that many students will use git anyway. We have noticed B. Qualitative analysis that almost a half of our students have made their own The course has received good feedback from the students. publicly available forks of our GitHub repository. These However, we have found some problems in the way our course forks included their solutions to the problems, which is was organized: undesirable, as it makes it easy for future students to cheat. • We have simplified the code for tuples too much, so in • Add means to check if a solution was copied from order to make block-oriented processing beneficial, the some other student. This time we had to dedicate three students would have to rework a significant part of the days before the exam in order to manually check all system, especially since this task was given late in the submissions that were accepted by the testing system. course. • Make building and running the code separate isolated • Easy access to automated testing prevented students from stages. This would allow us to keep full compilation logs writing their own tests. This has reflected on the to- and show them to students in case their code fails to build, tal number of submissions, which has exceeded 1000. as well as measure the actual query processing time. Perhaps we should have provided students with a data The latter would in turn allow us to range the students generator instead of relying on them to find or develop based on their code’s performance, thus giving them an one. incentive to find better solutions. • Testing correctness by running a query and checking its • Provide students with tools to test their code locally: result can be excessive, especially when testing such a data generator; some simple logging facility; and a features as query rewriting. Furthermore, writing tests verbose version of our testing program, which would give becomes unnecessarily hard in such cases, since the the user detailed error messages without being afraid to only criterion for success is whether the queries can be leak test data. executed within a given time limit. A better approach would be to check the query plan directly. VIII. CONCLUSION • At the earlier stages of the course (before implement- In this paper we have described our experience of organiz- ing a full-blown query optimizer) query execution time ing practice lessons for a DBMS development course. We have depended heavily on the order of joins. This made it outlined some of the approaches that have been taken by others impossible for one of the students to pass the tests despite before us and outlined their advantages and disadvantages. having a perfectly working solution. In order to conduct the course we have developed a proto- • Due to the fact that each run of the students’ programs type of a simple DBMS and put together a set of problems for corresponded to running a single query, some students the students, along with a reference solution. To simplify the made questionable design choices. For example, two stu- testing process for all parties involved, we have developed a dents updated catalog information during query rewriting system for automated testing. instead of relying on temporary data structures. We also describe the problems we have encountered during • There are some things that cannot be tested in an auto- this course. Most of them have to do with the fact that this mated manner. The greatest example of that is student’s was the first iteration of this course and we lacked the time understanding of the code they have written. There was to plan and implement every desirable feature. an exam at the end of the course, however, it was However, we believe our approach is viable and therefore theoretical in nature and therefore there was no place for intend to continue conducting courses in this manner. There code-related questions. Conducting code-review sessions is much to improve in our course besides dealing with the during practice lessons would be greatly beneficial. aforementioned problems. Our plans for future work are also described in the paper. VII. FUTURE WORK REFERENCES This was just the first iteration of the course. We hope [1] A. Oussous, F.-Z. Benjelloun, A. A. Lahcen, and S. Belfkih, “Big data to improve our testing system and additional materials. In technologies: A survey,” Journal of King Saud University - Computer and Information Sciences, vol. 30, no. 4, pp. 431 – 448, 2018. line]. Available: http://cte.eltech.ru/ojs/index.php/kio/article/view/1320 [Online]. Available: http://www.sciencedirect.com/science/article/pii/ [9] C. Genzmer, V. Hudlet, H. Park, D. Schall, and P. Senellart, “The S1319157817300034 SIGMOD 2010 programming contest a distributed query engine,” [2] A. Ailamaki and J. M. Hellerstein, “Exposing undergraduate SIGMOD Record, vol. 39, no. 2, pp. 61–64, 2010. [Online]. Available: students to database system internals,” SIGMOD Rec., https://doi.org/10.1145/1893173.1893185 vol. 32, no. 3, pp. 18–20, Sep. 2003. [Online]. Available: [10] R. Ramakrishnan and J. Gehrke, Database Management Systems, 3rd ed. http://doi.acm.org/10.1145/945721.945725 New York, NY, USA: McGraw-Hill, Inc., 2003. [3] E. Sciore, “Simpledb: A simple java-based multiuser system for teaching [11] G. Swart, “MinSQL: A simple componentized database for the database internals,” 01 2007, pp. 561–565. classroom,” in Proceedings of the 2nd International Conference on [4] “Harvard CS165: Data Systems,” http://daslab.seas.harvard.edu/classes/ Principles and Practice of Programming in Java, ser. PPPJ ’03. New cs165/, [Online; accessed 06-February-2019]. York, NY, USA: Computer Science Press, Inc., 2003, pp. 129–132. [5] “Программная инженерия больших данных,” [Online]. Available: http://dl.acm.org/citation.cfm?id=957289.957328 https://compscicenter.ru/courses/big-data-software-engineering/2018- [12] “Yandex.Contest,” https://contest.yandex.ru/, [Online; accessed 06- spring/, [Online; accessed 06-February-2019]. February-2019]. [6] “Своя СУБД за 3 недели. Нужно всего лишь каждый день немного [13] “Codeforces,” https://codeforces.com/, [Online; accessed 06-February- времени...” https://habr.com/ru/post/347274/, [Online; accessed 06- 2019]. February-2019]. [7] М. Цымблер, Л. Соколинский, А. Лепихов, “Прототипирование [14] “Stepik,” https://stepik.org/, [Online; accessed 06-February-2019]. параллельной СУБД как основа учебного курса по параллельным [15] “Coursera,” https://www.coursera.org/, [Online; accessed 06-February- системам баз данных,” Суперкомпьютерные системы и их 2019]. применение, 2004, с. 212–217. [16] “Travis CI,” https://travis-ci.org/, [Online; accessed 06-February-2019]. [8] K. Smirnov and G. Chernishev, “ACM SIGMOD Programming Contest: [17] “Jenkins,” https://jenkins.io/, [Online; accessed 06-February-2019]. an opportunity to study distinguished aspects of database systems and [18] “ToyDBMS GitHub repository,” https://github.com/ software engineering,” Computer Tools in Education, no. 5, 2014. [On- chernishev/ToyDBMS, [Online; accessed 06-February-2019].