On the Interaction of Relational Database Access Technologies in Open Source Java Projects Alexandre Decan? , Mathieu Goeminne?† and Tom Mens? ? Software Engineering Lab, University of Mons, Belgium Email: { first . last } @ umons.ac.be † Center of Excellence in Information and Communication Technologies, Belgium Email: mathieu.goeminne@cetic.be 1 Introduction Abstract As software systems become more and more complex, the e↵ort required for creating new systems and main- taining existing ones increases over time. This ef- This article presents an empirical study of fort can be reduced by embedding code in reusable how the use of relational database access tech- libraries that o↵er services for supporting a particular nologies in open source Java projects evolves aspect of the developed system. For example, for soft- over time. Our observations may be useful ware systems that strongly interact with a relational to project managers to make more informed database, numerous technologies (libraries, APIs and decisions on which technologies to introduce frameworks) exist for connecting the program code to into an existing project and when. We se- the database. Understanding how database technolo- lected 2,457 Java projects on GitHub using gies tend to replace or complement existing ones in the low-level JDBC technology and higher- software projects can help project managers in choos- level object relational mappings such as Hi- ing the most appropriate technology, and the most ap- bernate XML configuration files and JPA an- propriate moment of introducing this technology. notations. At a coarse-grained level, we anal- The program code can be connected to the database ysed the probability of introducing such tech- in various ways. In the simplest case, the code will nologies over time, as well as the likelihood contain embedded database queries (e.g., SQL state- that multiple technologies co-occur within the ments) that will be interpreted by the database man- same project. At a fine-grained level, we anal- agement system. In more complex cases, especially ysed to which extent these di↵erent technolo- for object-oriented programs, object-relational map- gies are used within the same set of project pings (ORM) will be provided to translate program files. We also explored how the introduction concepts (e.g., classes, methods and attributes) into of a new database technology in a Java project database concepts (e.g., tables, columns and values), impacts the use of existing ones. We ob- so that database elements can be created, read, up- served that, contrary to what could have been dated or deleted (CRUD) directly by manipulating expected, object-relational mapping technolo- object-oriented views. Despite the fact that ORMs gies do not tend to replace existing ones but abstract away from technical connection details in or- rather complement them. der to facilitate software development, some evolution- related problems remain. Copyright c 2016 by the paper’s authors. Copying permitted The high level of dynamic of current database ac- for private and academic purposes. This volume is published cess technologies makes it hard for a programmer to and copyrighted by its editors. figure out which SQL queries will be executed at a In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE 2015 Seminar on Advanced Techniques and Tools for Software given location of the program source code, or which Evolution, University of Mons, Belgium, 6-8 July 2015, source code methods actually access a given database published at http://ceur-ws.org table or column. Conversely, the high level of ab- straction provided by the ORMs makes it hard to de- 2 State of the Art termine the impact on the program code of changes While the literature on database schema evolution is in the database schema. In addition, co-evolving the very large [1], few authors have proposed approaches database and the program requires to master multiple to systematically observe how developers cope with languages and technologies. database evolution in practice. Sjoberg [2] presented a study where the database schema evolution of a large- This paper examines how popular technologies are scale medical application is measured and interpreted. used in open source Java projects for connecting the Vassiliadis et al. [3] studied the evolution of individual source code to a relational database. To do so, we database tables over time in eight di↵erent software focus on three research questions: systems. Several researchers have tried to identify, extract RQ1 – When and in which order are database tech- and analyse database usage in application programs. nologies introduced in a project? We observe that they The purpose of the proposed approaches ranges from tend to be introduced very early in the project’s life- error checking [4, 5, 6], over SQL fault localisa- time. This is expected, since those technologies are tion [7], to fault diagnosis [8]. More recently, Linares- typically central components of the projects in which Vasquez et al. [9] studied how developers document they occur. We also observe that multiple database ac- database usage in source code. Their results show cess technologies are used in many projects, and that that a large proportion of database-accessing methods they tend to be used simultaneously. Finally, we study is completely undocumented. which technologies tend to be complemented by other Several empirical studies have analysed the evolu- technologies. tion of library and technology usage. Bauer and Heine- RQ2 – How does the introduction of a new technol- mann [10] were able to identify distinct evolution sce- ogy in a project a↵ect the already included ones? With narios for API dependencies in software projects. The this question we wish to understand whether technolo- gained knowledge may be useful for evaluating oppor- gies tend to replace existing ones, or rather comple- tunities in API migration and evolution. Teyton et ment them. In the former case, the introduction of a al. [11] identified sets of similar libraries in a large new technology would decrease the use of the already corpus of software projects. The obtained results can included technology. In the latter case, the new tech- be used for suggesting alternative libraries to project nology may serve as a catalyst, leading to an increased managers who want to migrate from a library to an- of the already included technology. other one. In [12] they investigate how and why library migrations occur. They found that library migrations RQ3 – To which extent does the introduction of a are relatively rare, and projects that have witnessed new technology impact the way in which a project ac- more than one migration are exceptional. They also cesses the database? This question focuses on the evo- observed that migration is generally an atomic change lution of project files that use a particular technol- performed by a single developer in a single commit. ogy, after introducing a new database technology in the project: are these files modified in order to bene- fit from the newly introduced technology? For certain 3 Methodology and Data Extraction pairs of technologies, we found this to be the case. For The empirical study in this paper focuses on open most pairs of technologies however, existing database- source Java systems. Java is among the most popular related files do not substantially adopt the latest in- programming languages today, and a large number of troduced technology. technologies and frameworks are available to facilitate relational database access from within Java code. The choice for open source systems is motivated by the ac- The remainder of this paper is structured as follows. cessibility of the entire history of the source code in Section 2 presents attempts to methodically analyse freely accessible version control repositories. and compare similar technologies that can be found in the scientific literature and puts our research in per- 3.1 Considered Database Access Technologies spective. Section 3 presents the approach we followed for collecting the data required for our empirical study In previous work [13, 14], we considered 26 Java rela- as well as the methodology for analysing it. The next tional database technologies that o↵er a direct means three sections address our research questions. Sec- of accessing a relational database and whose presence tion 7 discusses the threats to validity of our study. in a project is identifiable through static analysis. By Section 8 discusses possible extensions of the presented analysing the import statements in Java files as well study, and Section 9 concludes. as the presence of specific configuration files, we deter- mined the presence of each of these technologies. We counterparts of database entities. The Java Persis- performed a survival analysis of the technologies used tence API 3 (denoted by jpa hereafter) is the de facto in order to determine their relative importance over Java standard for annotation-based mappings. jpa was time in the considered projects. first released in 2006, and relies on the Java annota- This paper provides a more in-depth study, by look- tion mechanism that was first introduced in Java 5. ing at the interaction between object-oriented source We consider this technology as representative for this code and relational databases at a more fine-grained kind of mapping description. level. We have selected three popular technologies that In our study we consider that a Java file relates to are representative of a particular way to connect the jpa if the Entity, Embeddable, or MappedSuperclass source code to a database (embedded SQL, external annotations from package javax.persistence can be mapping files, and Java annotations): found in this file. JDBC Discussion jdbc1 is a low-level technology for connecting Java pro- As witnessed by many discussions on Stack Overflow4 , grams to a database by sending SQL queries directly there is no consensus on which of these three technolo- from within the source code. While version 1.1 was gies is the most appropriate for any given project, as released in 1997, there have been regular version up- it may depend on many project-related characteristics, grades to cope with the evolution of the Java language. technological choices or even personal preferences. This technology is still intensively used in numerous One should also note that the use of these technolo- projects [13], despite the inherently close coupling that gies is not exclusive. A project may use all of these is required between the source code and the database technologies simultaneously. These technologies may schema. even be used together within the same Java source In our study we consider this technology as being code files. associated to a Java source code file if entities belong- ing to java.sql are imported in this file. 3.2 Selected Projects In order to obtain a representative project sample, we Hibernate based our empirical analyses on Java projects belong- ing the GitHub project corpus proposed by Allamanis ORM technologies rely on a mapping description for and Sutton [16]. Among these projects, 13,307 still associating (object-oriented) source code elements to had an available Git repository on 24 March 2015. database elements. They aim to reduce the so-called In order to carry out our empirical study, we se- object-relational impedance mismatch [15]. The map- lected 2,457 projects from this project corpus for which ping description can take the form of configuration at least one of the commits contained a reference to files, placed aside source code files, to express the re- either jdbc, jpa or hbm. For each selected project, we lations between the considered entities. Hibernate is extracted the existing relations between source code a popular open source Java framework adopting this and database entities from the first commit of each solution. It was first released in 2001, and provides an week, and we obtained an historical view of all the abstraction layer on top of jdbc. Hibernate has been files that can be related to a particular technology or criticised by many of not being a 100% transparent to a particular framework. data persistence solution. mean stdev median max. In our study we analyse Hibernate2 XML config- duration (in weeks) 76 121 23 812 uration files (denoted by hbm hereafter), and con- # commits 1317 6013 126 174,618 sider that a Java file relies on Hibernate technology # contributors 12 31 4 1091 if at least one Hibernate configuration file mentions # files in HEAD 1058 3549 213 103,493 the Java file as a code entity resource. # Java files in HEAD 512 1793 88 46,661 Table 1: Characteristics of the selected projects. JPA HEAD refers to the latest extracted version. Annotation-based mapping descriptions o↵er an in- Table 1 shows some of the characteristics of the se- creasingly popular means to express the relations re- lected projects. The distribution of metrics values is quired by ORM engines. With such mappings, Java annotations are used to mark program elements as 3 oracle.com/technetwork/java/javaee/tech/ persistence-jsp-140049.html 1 oracle.com/technetwork/java/javase/jdbc/ 4 see for example stackoverflow.com/questions/Q 2 hibernate.org/ with Q = 1607819, 2397016, 2560500 or 530215. two distributions of the introduction time of the tech- nology in a project. The first distribution (left) con- siders the first time a technology gets introduced in a project. The second distribution (right) considers the introduction of the technology in a project that already had a technology before. As expected, we ob- serve that more than 50% of the introductions of a first technology are done in the first 10% of the project’s lifetime. For technologies intro- duced after an existing one, the distribution tends to be flatter. We also observe that the two distributions for jdbc Figure 1: Number of projects per considered technol- present less di↵erences than the ones related to jpa or ogy. hbm. To achieve this, we performed a Kolmogorov- Smirnov statistical test for each pair of distributions highly skewed, suggesting evidence of a Pareto princi- related to jdbc, jpa and hbm. The tests show that the ple [17]. The duration is expressed in weeks between two distributions associated to each technology the first and the last commit. are significantly di↵erent (p-values are lower than Figure 1 reports the number of projects per con- 10 6 ). This may indicate that for jdbc, the moment sidered technology, taking the entire lifetime of each of introduction is less a↵ected by the presence project into account. We observe that the project sam- of another technology than for hbm and jpa. ple is relatively unbalanced with respect to the pres- We saw that the time at which a technology is in- ence of each technology, but each pair of technologies troduced in a project varies depending on the presence is still represented in a quite a number of projects. of another technology in this project. What are the technologies that are more likely to be succeeded by 4 RQ1 When and in which order are another one? database technologies introduced in To answer this question, we use the statistical tech- nique of survival analysis to estimate the probability a project? that a technology does not remain the last introduced Introducing a new technology in a software project one in a project lifetime. Survival analysis [18] creates comes with a certain cost. A common policy is there- a model estimating the survival rate of a population fore to introduce such a technology only if the expected over time, considering the fact that some elements of benefits outweigh the expected cost. the population may leave the study, and for some other For each project, we analysed at what moment in elements the event of interest does not occur during the the projects’ lifetime each considered technology got observation period. In our case, the observed event is introduced. The answer appears to depend on the du- the introduction in a project of another technology af- ration of the considered projects. To minimise the ter an existing one. e↵ect of project duration, we normalised the lifetime of each project into a range between 0 (the start of the project) and 1 (the last considered commit). Figure 3: Probability that a technology remains the last introduced technology over time. Figure 3 shows the survival rates for each consid- Figure 2: Violin plot (using a kernel density estimate) ered technology. We observe that hbm has a much of the distribution of the introduction time of a tech- lower survival rate (i.e., a lower probability of staying nology in the Java project corpus. the last introduced technology for a long time) than the other technologies. We also observe that, during Figure 2 compares, for each considered technology, the first 10% of the projects’ lifetime, the survival rates of hbm decrease by 30%, representing a more impor- 5 RQ2 How does the introduction of tant decrease than for the other two technologies. This a new technology in a project a↵ect implies that hbm is usually quickly replaced or the already included ones? complemented by another technology. Figure 1 showed that around 23% of the projects As multiple database access technologies are used in use two or more database technologies in their lifetime, many projects, either simultaneously or one after the but these are not necessarily used simultaneously. We other, it is useful to study how the introduction of a therefore identified which combinations of technologies new technology can impact the use of an already in- actually co-occur in the selected Java projects. Fre- cluded one. This impact, if it occurs, could result in an quent co-occurrences would reveal which technologies increased or decreased usage of the already included are complementary, and which technologies are used technology. We therefore identified and counted for as supporting technologies of other ones. For each which projects the introduction of a new technology pair of technologies, we counted the number of projects causes an increasing use of the older technology, a de- in which these technologies actually co-occur, and in creasing use, or no observable change in the use of the which order they were introduced in these projects. already included technology. The results are summarised in Table 2. To qualify the impact, we rely on the first derivative of the number of files related to an existing technology. (A, B) ! (jdbc, jpa) (jdbc, hbm) (jpa, hbm) # projects 497 152 84 We computed and compared the mean of this deriva- # co-occurrences 488 148 77 tive for two 8-week periods: the first period strictly % co-occurrences 98.2% 97.4% 91.7% precedes the moment of introduction of the new tech- startA < startB 157 50 19 nology, and the second period immediately follows the startA > startB 151 27 37 startA = startB 189 75 28 moment of introduction. In the following, we will use the term variation to Table 2: Projects characteristics by pairs (A, B) of co- denote the di↵erence between the mean of the second occurring technologies period and the mean of the first period. The variation of a technology is easy to interpret: a positive value Among all projects that use multiple tech- indicates an increasing use of the existing technology nologies during their lifetime we observe a while a negative value indicates a decreasing use of the very high proportion of co-occurring technolo- existing technology gies. More specifically, in 97.3% (488+148+77 out of 497+152+84) of all the situations in which two dis- tinct technologies were used during a project’s life- time, they were used simultaneously. Around 41% (189+75+28 out of 488+148+77) of all pairs of co- occurring technologies were introduced simultaneously (startA = startB ), implying that around 59% of all pairs of co-occurring technologies concern projects in which the technologies were introduced at di↵erent moments (startA 6= startB ). Considering the number of projects in which the Figure 4: Impact of the introduction of a new technol- introduction of a technology A was observed before ogy on the activity of an already included technology. the use of a technology B, it seems that jpa tends to succeed to hbm more often than the contrary Figure 4 shows the distribution of the variation for (37 versus 19 observations). Similarly, hbm tends to each pair of technologies. We observe that jdbc and succeed to jdbc more often than the contrary hbm cause a slight positive impact on the use of (50 versus 27 observations). We did not identify such existing technologies (since the variation tends to an order for jpa and jdbc (151 versus 157 observations). be positive in 75% of all cases). Notice the important variation induced by introducing hbm in projects using Summary. All considered technologies are in- jpa. The converse is not true: introducing jpa in a troduced early in the projects’ lifetimes, even for project that already uses hbm implies a negative projects that already use another technology. The variation for hbm. number of projects in which multiple technologies co-occur is proportionally important. The order Figure 4 only identifies global trends in our project in which these technologies are introduced sug- corpus. It does not allow to identify trends within gests that hbm is often succeeded by jdbc or jpa. individual projects. Figure 5 therefore distinguishes the projects that exhibit a positive variation (blue curve), a negative variation (red curve) or no varia- tion (green curve) for several time intervals after the introduction of the new technology. Figure 6: Probability that at least 25% of files related to a technology remain after the introduction of an- other technology. Summary. Introducing a new technology gen- erally induces, in the short term, an increase of Figure 5: Number of projects with an increasing, de- the presence of the already included technology, creasing or stable activity of an already included tech- with the notable exception of the introduction of nology, as observed x weeks after introducing another jpa on a project that already makes use of hbm. technology. This suggests that, contrary to the promises of ORM technologies, new technologies do not tend Regardless of the considered pair of technologies, to replace existing ones but rather complement with the notable exception of the pairs (jpa after hbm) them. and (hbm after jpa), both the number of projects hav- ing no variation and the number of projects having a 6 RQ3 To which extent does the intro- positive variation are systematically greater than the number of projects exhibiting a negative variation. duction of a technology impact the Figure 6 shows survival curves, using a Kaplan- way in which a project accesses the Meier estimator, of the probability that a project keeps database? more than a threshold of 25% of its files related to an From the results of RQ1 we observed that, if a project already included technology after the introduction of uses multiple database access technologies over its life- new one. We tried di↵erent threshold values and they time, these technologies tend to co-occur. At a more all lead to the same conclusions. fine-grained level, we are interested in the impact of Again, we observe that the most distinct be- the introduction of a technology on the files that al- haviours are exhibited by jpa and hbm: the prob- ready relate to a previously used technology. ability to keep more than 25% of files related to hbm drops below 0.55 about 20 weeks after introducing jpa, while the probability for jpa files drops to a little more 6.1 Do di↵erent technologies co-occur at file than 0.6 about 19 weeks after introducing hbm. This level? analysis corroborates our previous observations: in- Let us first study the co-occurrences of di↵erent tech- troducing jpa or hbm does not negatively im- nologies at file level without taking the evolutionary pact the use of jdbc, and conversely. We also ob- aspect into account. Figure 7 shows, for each pair of serve from Figures 5 and 6 that most of the impact technologies, the distribution across projects of the ra- happens in the first weeks after introducing the tio between the number of files that relate to each, or new technology. both, technologies, and the number of files that relate to any of these technologies. For each pair of tech- nologies, only projects in which both technologies have Let us associate a migration profile to each project been used at some point in their lifetime have been re- at di↵erent points in time after the introduction of the tained as elements of the distribution. new technology. This migration profile reflects how the files related to the old technology are impacted. It is computed as follows: Let P be a project and T = {jdbc, hbm, jpa} the considered technologies. For each point in time t for P and each technology A 2 T we define relatedP (A, t) as the (possibly empty) subset of (fully qualified) file- names of P in which technology A was detected at time t. For every pair of distinct technologies (A, B) 2 Figure 7: Relative number of files relating to pairs of T ⇥ T , we write M = (P, A, B) if P is a project in technologies. which technology B gets introduced while a technol- ogy A is already in use. Let tM denote the point in It turns out that pairs of technologies including jdbc time of this introduction and FM = relatedP (A, tM ) present similar profiles: most projects contain a small the set of filenames associated to technology A. For proportion of files using both technologies. A two- each t tM we associate to each f 2 FM a label sided Kolmogorov-Smirnov test confirms this similar- in L = {residual, removed, complemented, replaced} as ity between distributions: we cannot reject the null hy- follows: pothesis that states that the distributions associated residual if f 2 relatedP (A, t) \ relatedP (B, t) to the proportion of files using a single technology are removed if f 2/ relatedP (A, t) [ relatedP (B, t) identical (p = 0.877 and 0.287, respectively). We con- complemented if f 2 relatedP (A, t) \ relatedP (B, t) clude that jdbc is generally not used in the same replaced if f 2 relatedP (B, t) \ relatedP (A, t) files as jpa and hbm. Given M , we also associate to each t tM a set of The pair of technologies jpa and hbm presents a dif- labels mpM (t) ✓ L. A label L 2 L belongs to mpM (t) ferent behaviour. The three distributions of the pro- if, among the labels associated to each f 2 FM at time portion of files that only relate to these technologies t, no other label occurs more frequently than L. are significantly di↵erent (we reject the null hypothesis Finally, the migration profile of M at time t is a with p < 0.001). This result, combined with the form unique label from mpM (t) selected based on the total of the distributions, suggests that, for projects having order replaced > complemented > removed > residual. used jpa and hbm, a file is likely to relate either This total order privileges migration profiles that cor- to jpa only or to both jpa and hbm. In addition to respond to the adoption of the new technology. this, the proportion of files that use both hbm and jpa As the choice of a total order could have altered is more important than for the other considered pairs the results of our analysis, we compared the results of technologies. obtained with several total orders, and we observed only slight local variations. This is not surprising as Summary. There is a clear separation between there are only 72 pairs (M, t) such that |mpM (t)| > 1, files using jdbc and files using the two other tech- representing 1.78% of all the considered pairs. nologies. For the combination of hbm and jpa, Figure 8 shows the evolution of the proportion of a partial, asymmetric overlap exists at file level: projects with a given migration profile. For the sake of hbm is often used in the same files as jpa, while jpa readability, we only present results for complemented , is rarely used in combination with another tech- replaced , and removed . The results for residual can nology in the same file. be deduced from these, by taking the complement of complemented , replaced and removed . We observe that, for each considered pair of tech- 6.2 How does the co-occurrence of technolo- nologies, and for each time delay (expressed in weeks) gies at file level evolve over time? after the introduction of the new technology, most Let us now look at the same question from an evo- projects relate to the residual migration profile, im- lutionary point of view, by assessing the impact, at plying that projects tend not to adapt their existing file-level, of introducing a new technology in a project database access files to make use of the newly intro- that already uses another technology to access the duced technology. This is especially true for projects database. To do this, we study how the files related to introducing jdbc after jpa or hbm. an existing technology get changed after introduction The second dominant migration profile is removed . of the new technology. Regardless of the considered pair of technologies, more 7 Threats to validity Our research su↵ers from the same threats as other research relying on Git and GitHub [19, 20]. The selected Java projects potentially su↵er from the same generalisability constraints as in [16]. The open source GitHub Java project corpus was curated to exclude low-quality projects (by ignoring projects that were never forked) and project duplicates. While our corpus contained 2,457 projects, the number of projects involved in some pairs of database technologies were sometimes much lower. For example, only 19 projects were concerned by a migration from jpa to hbm (cf. Table 2). The accuracy of our obser- vations could be increased by using a larger project corpus. The detection of a technology is based on the static analysis of code and project-specific artefacts (e.g., Java annotations, import statements and XML files). This approach can lead to false positives: the presence of these artefacts does not necessarily reflect the actual use of the related technology. Some of our analyses are based on arbitrarily chosen Figure 8: Proportion (stacked) of projects for each thresholds and on weekly time intervals. Because our migration profile. The complement corresponds to re- results may depend on these thresholds and intervals, placed . we repeated our experiments with di↵erent parameters but did not observe any major di↵erences. and more projects are associated to this migration pro- file. Over time, an increasing number of projects tend to reduce the number of files relating to the first con- 8 Future Work sidered technology. The predominance of residual and The results presented in this article, possibly com- removed migration profiles seems to convey that, in bined with more traditional project quality metrics, many cases, files that related to the existing could be integrated in a managerial dashboard. Such technology are not prone to use the newly in- a dashboard could be used to compare the character- troduced technology. Instead, they either continue istics and the evolution of a particular project against to use the first technology or they tend to lose any those belonging to the analysed project corpus. This relation to database access management. would support project managers in evaluating and ex- The two other migration profiles, complemented ploiting the expected benefits and disadvantages from and replaced , indicate an e↵ective file migration from introducing a new technology, as well as in assessing the existing technology to the newly introduced one. the impact of how this technology will become used Such cases appear to be much less represented in our in the project over time. Any ensuing managerial de- corpus, with the exception of projects in which jpa or cisions will obviously depend on project-specific rules jdbc is introduced after hbm. This is especially the and guidelines that could hardly be generalized. case when jpa is introduced in a project using This paper used static analysis techniques to de- hbm: the files that were related to hbm become tect the presence of a particular technology. Using dy- (sometimes exclusively) related to jpa. namic analysis techniques could reveal how database technologies are actually used in running systems. The Summary. Di↵erent technologies generally do analysis of queries submitted to the database at run- not tend to co-occur in the same set of files, ex- time could be used for understanding to which extent cept, to some extent, when jpa and hbm are used ORM technologies hide complexity to developers. together. We do not observe a true migration in technology usage: files that are related to a given This paper focused on relational database access technology do not tend to adopt the newly in- technologies based on three representative technologies troduced technology, except for projects that mi- (jdbc, Hibernate and jpa). It could be useful to include grate from hbm to another technology. other Java specifications for object persistence as well, such as JDO. It would also be useful to consider other kinds of databases (such as NoSQL, graph or object- low-level jdbc solution is massively replaced by hbm or oriented databases), since these are becoming increas- jpa. The only significant technology migration we ob- ingly more popular. A follow-up study could take into served concerns the transition from hbm to jpa. More account such alternative database technologies. specifically, we summarise our main observations be- Other technological domains (beyond databases) low. could be considered as well. Event loggers, graph- We analysed the evolution and co-occurrences of ical user interfaces, and unit tests are examples of the technologies in order to get a high-level view of features supported by multiple concurrent technolo- their usage in the considered Java projects. It appears gies. Since the identification of the technology used in that, most of the time, database technologies are in- project files is the only part of our methodology that troduced early in the projects’ lifetime, whether they depends on the considered technologies, our approach are the first technology introduced or not. Once intro- could be easily adapted to study other technologies. duced in a project, hbm tends to be complemented or Section 7 mentioned the limitations of the selected replaced by another technology more frequently and project corpus. We therefore intend to confirm our more quickly than jpa and jdbc. research results by considering a larger project corpus, We also analysed how the technologies are used in including both open and closed source projects. We the source code files. The introduction of jdbc and also intend to study the e↵ect of project quality and hbm tends to be followed by an increasing use of the project maturity on the obtained results. Finally, we already present database technology. This increase is intend to include other programming languages than particularly important when hbm is introduced after Java in the project corpus in order to avoid any bias jpa. Conversely, the introduction of jpa reduces the introduced by language-specific characteristics. use of hbm. jpa therefore appears to replace existing While this paper only focused on technical aspects hbm in the database-related source-code files, while the of connecting source code to databases, we plan to converse is not true. study the social aspects of systems involving such a Furthermore, jdbc generally does not share source database connection. More precisely, we would like to code files with the two other considered database tech- determine if the di↵erent technologies are introduced nologies. While jpa is used in isolation in a majority and managed by di↵erent teams or persons. Inspired of source code files, hbm tends to be used more often by [21] we also aim to analyse the developer character- in conjunction with jpa. The study of the evolution of istics in order to determine how these a↵ect the take- such co-occurrence reveals that a file migration from a up, use, evolution and migration of technologies. Some technology to another one is only observed from hbm examples of developer characteristics are their degree to jpa. In most projects, the introduction of a new of specialisation, diversity, seniority, skills, and work- database technology is not followed by a massive adop- load. tion of this technology by the existing database-related Finally, we plan to analyse software systems in or- files, until these files become database-unrelated or are der to automatically identify library features used in removed from the source code repository. the source code, as well as feature similarities between Exploiting all these results in a dashboard that sup- di↵erent technologies. In situations where developers ports managers in making project-specific decisions want to migrate from a given technology to another, with respect to the introduction, use or evolution of such a feature identification and mapping is a first step database access technologies remains part of future towards better support for assisted or automatic mi- work. gration [22]. Acknowledgment 9 Conclusions This research was conduced as part of the FRFC research project T.0022.13 “Data-Intensive Software Through static analysis of Java source code we carried System Evolution” that was financed by the F.R.S.- out a large-scale empirical study to understand how FNRS, Belgium. database access technologies interact with one another. We considered three popular technologies (jdbc, hbm and jpa) that represent di↵erent means to connect Java source code files to a relational database. We selected data from 2,457 open source projects on GitHub that used at least one of the considered technologies. Our study revealed common behaviours in the use of these three technologies. In spite of the promises of ORM technologies, we found no evidence that the References [12] C. Teyton, J. Falleri, M. Palyart, and X. Blanc, “A study of library migrations in Java,” Jour- [1] E. Rahm and P. A. Bernstein, “An online bib- nal of Software: Evolution and Process, vol. 26, liography on schema evolution,” SIGMOD Rec., no. 11, pp. 1030–1052, 2014. vol. 35, no. 4, pp. 30–31, Dec. 2006. [13] M. Goeminne and T. Mens, “Towards a survival [2] D. Sjoberg, “Quantifying schema evolution,” In- analysis of database framework usage in Java formation and Software Technology, vol. 35, no. 1, projects,” in Int’l Conf. Software Maintenance pp. 35 – 44, 1993. and Evolution, 2015. [3] P. Vassiliadis, A. V. Zarras, and I. Skoulis, “How [14] M. Goeminne, A. Decan, and T. Mens, is life for a table in an evolving relational schema? “Co-evolving code-related and database-related Birth, death and everything in between,” in Int’l changes in a data-intensive software system,” in Conf. Conceptual Modeling (ER), 2015, pp. 453– CSMR-WCRE Software Evolution Week, 2014, 466. pp. 353–357. [15] M. N. C. Ireland, D. Bowers and K. Waugh, [4] A. S. Christensen, A. Møller, and M. I. “A classification of object-relational impedance Schwartzbach, “Precise analysis of string expres- mismatch,” in Intl Conf. Advances in Databases, sions,” in Int’l Conf. Static Analysis (SAS), 2003, Knowledge, and Data Applications (DBKDA), pp. 1–18. 2009, pp. 36–43. [5] C. Gould, Z. Su, and P. Devanbu, “Static checking [16] M. Allamanis and C. Sutton, “Mining source code of dynamically generated queries in database ap- repositories at massive scale using language mod- plications,” in Int’l Conf. Software Engineering. eling,” in Int’l Conf. Mining Software Reposito- IEEE Comp. Soc., 2004, pp. 645–654. ries. IEEE, 2013, pp. 207–216. [6] M. Sonoda, T. Matsuda, D. Koizumi, and S. Hi- [17] M. Goeminne and T. Mens, “Evidence for the rasawa, “On automatic detection of SQL injec- Pareto principle in open source software activity,” tion attacks by the feature extraction of the single in Workshop on Software Quality and Maintain- character,” in Int’l Conf. Security of Information ability (SQM), ser. CEUR Workshop Proceedings, and Networks (SIN), 2011, pp. 81–86. vol. 701. CEUR-WS.org, 2011, pp. 74–82. [18] I. Samoladas, L. Angelis, and I. Stamelos, “Sur- [7] S. R. Clark, J. Cobb, G. M. Kapfhammer, J. A. vival analysis on the duration of open source Jones, and M. J. Harrold, “Localizing SQL faults projects,” Information & Software Technology, in database applications,” in Int’l Conf. Auto- vol. 52, no. 9, pp. 902–922, 2010. mated Software Engineering (ASE), 2011, pp. 213–222. [19] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamil- ton, D. M. Germán, and P. T. Devanbu, “The [8] M. A. Javid and S. M. Embury, “Diagnosing promises and perils of mining Git,” in Int’l Conf. faults in embedded queries in database applica- Mining Software Repositories, 2009, pp. 1–10. tions,” in EDBT/ICDT’12 Workshops, 2012, pp. 239–244. [20] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. Germán, and D. Damian, “The [9] M. Linares-Vasquez, B. Li, C. Vendome, and promises and perils of mining GitHub,” in Int’l D. Poshyvanyk, “How do developers document Conf. Mining Software Repositories, 2014, pp. 92– database usages in source code?” in Int’l Conf. 101. Automated Software Engineering (ASE), 2015. [21] B. Vasilescu, A. Serebrenik, M. Goeminne, and [10] V. Bauer and L. Heinemann, “Understanding API T. Mens, “On the variation and specialisation of usage to support informed decision making in workload: A case study of the Gnome ecosystem software maintenance,” in European Conf. Soft- community,” J. Empirical Software Engineering, ware Maintenance and Reengineering, 2012, pp. pp. 1–54, 2013. 435–440. [22] C. Teyton, J.-R. Falleri, and X. Blanc, “Auto- matic discovery of function mappings between [11] C. Teyton, J. Falleri, and X. Blanc, “Mining li- similar libraries,” in Working Conf. Reverse En- brary migration graphs,” in Working Conf. Re- gineering, Oct 2013, pp. 192–201. verse Engineering, 2012, pp. 289–298.