=Paper=
{{Paper
|id=Vol-1820/paper-03
|storemode=property
|title=On the Interaction of Relational Database Access Technologies in Open Source Java Projects
|pdfUrl=https://ceur-ws.org/Vol-1820/paper-03.pdf
|volume=Vol-1820
|authors=Alexandre Decan,Mathieu Goeminne,Tom Mens
|dblpUrl=https://dblp.org/rec/conf/sattose/DecanGM15
}}
==On the Interaction of Relational Database Access Technologies in Open Source Java Projects==
On the Interaction of Relational Database Access
Technologies in Open Source Java Projects
Alexandre Decan? , Mathieu Goeminne?† and Tom Mens?
?
Software Engineering Lab, University of Mons, Belgium
Email: { first . last } @ umons.ac.be
†
Center of Excellence in Information and Communication Technologies, Belgium
Email: mathieu.goeminne@cetic.be
1 Introduction
Abstract As software systems become more and more complex,
the e↵ort required for creating new systems and main-
taining existing ones increases over time. This ef-
This article presents an empirical study of fort can be reduced by embedding code in reusable
how the use of relational database access tech- libraries that o↵er services for supporting a particular
nologies in open source Java projects evolves aspect of the developed system. For example, for soft-
over time. Our observations may be useful ware systems that strongly interact with a relational
to project managers to make more informed database, numerous technologies (libraries, APIs and
decisions on which technologies to introduce frameworks) exist for connecting the program code to
into an existing project and when. We se- the database. Understanding how database technolo-
lected 2,457 Java projects on GitHub using gies tend to replace or complement existing ones in
the low-level JDBC technology and higher- software projects can help project managers in choos-
level object relational mappings such as Hi- ing the most appropriate technology, and the most ap-
bernate XML configuration files and JPA an- propriate moment of introducing this technology.
notations. At a coarse-grained level, we anal- The program code can be connected to the database
ysed the probability of introducing such tech- in various ways. In the simplest case, the code will
nologies over time, as well as the likelihood contain embedded database queries (e.g., SQL state-
that multiple technologies co-occur within the ments) that will be interpreted by the database man-
same project. At a fine-grained level, we anal- agement system. In more complex cases, especially
ysed to which extent these di↵erent technolo- for object-oriented programs, object-relational map-
gies are used within the same set of project pings (ORM) will be provided to translate program
files. We also explored how the introduction concepts (e.g., classes, methods and attributes) into
of a new database technology in a Java project database concepts (e.g., tables, columns and values),
impacts the use of existing ones. We ob- so that database elements can be created, read, up-
served that, contrary to what could have been dated or deleted (CRUD) directly by manipulating
expected, object-relational mapping technolo- object-oriented views. Despite the fact that ORMs
gies do not tend to replace existing ones but abstract away from technical connection details in or-
rather complement them. der to facilitate software development, some evolution-
related problems remain.
Copyright c 2016 by the paper’s authors. Copying permitted The high level of dynamic of current database ac-
for private and academic purposes. This volume is published cess technologies makes it hard for a programmer to
and copyrighted by its editors.
figure out which SQL queries will be executed at a
In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE
2015 Seminar on Advanced Techniques and Tools for Software
given location of the program source code, or which
Evolution, University of Mons, Belgium, 6-8 July 2015, source code methods actually access a given database
published at http://ceur-ws.org table or column. Conversely, the high level of ab-
straction provided by the ORMs makes it hard to de- 2 State of the Art
termine the impact on the program code of changes
While the literature on database schema evolution is
in the database schema. In addition, co-evolving the
very large [1], few authors have proposed approaches
database and the program requires to master multiple
to systematically observe how developers cope with
languages and technologies.
database evolution in practice. Sjoberg [2] presented a
study where the database schema evolution of a large-
This paper examines how popular technologies are scale medical application is measured and interpreted.
used in open source Java projects for connecting the Vassiliadis et al. [3] studied the evolution of individual
source code to a relational database. To do so, we database tables over time in eight di↵erent software
focus on three research questions: systems.
Several researchers have tried to identify, extract
RQ1 – When and in which order are database tech- and analyse database usage in application programs.
nologies introduced in a project? We observe that they The purpose of the proposed approaches ranges from
tend to be introduced very early in the project’s life- error checking [4, 5, 6], over SQL fault localisa-
time. This is expected, since those technologies are tion [7], to fault diagnosis [8]. More recently, Linares-
typically central components of the projects in which Vasquez et al. [9] studied how developers document
they occur. We also observe that multiple database ac- database usage in source code. Their results show
cess technologies are used in many projects, and that that a large proportion of database-accessing methods
they tend to be used simultaneously. Finally, we study is completely undocumented.
which technologies tend to be complemented by other Several empirical studies have analysed the evolu-
technologies. tion of library and technology usage. Bauer and Heine-
RQ2 – How does the introduction of a new technol- mann [10] were able to identify distinct evolution sce-
ogy in a project a↵ect the already included ones? With narios for API dependencies in software projects. The
this question we wish to understand whether technolo- gained knowledge may be useful for evaluating oppor-
gies tend to replace existing ones, or rather comple- tunities in API migration and evolution. Teyton et
ment them. In the former case, the introduction of a al. [11] identified sets of similar libraries in a large
new technology would decrease the use of the already corpus of software projects. The obtained results can
included technology. In the latter case, the new tech- be used for suggesting alternative libraries to project
nology may serve as a catalyst, leading to an increased managers who want to migrate from a library to an-
of the already included technology. other one. In [12] they investigate how and why library
migrations occur. They found that library migrations
RQ3 – To which extent does the introduction of a are relatively rare, and projects that have witnessed
new technology impact the way in which a project ac- more than one migration are exceptional. They also
cesses the database? This question focuses on the evo- observed that migration is generally an atomic change
lution of project files that use a particular technol- performed by a single developer in a single commit.
ogy, after introducing a new database technology in
the project: are these files modified in order to bene-
fit from the newly introduced technology? For certain 3 Methodology and Data Extraction
pairs of technologies, we found this to be the case. For The empirical study in this paper focuses on open
most pairs of technologies however, existing database- source Java systems. Java is among the most popular
related files do not substantially adopt the latest in- programming languages today, and a large number of
troduced technology. technologies and frameworks are available to facilitate
relational database access from within Java code. The
choice for open source systems is motivated by the ac-
The remainder of this paper is structured as follows. cessibility of the entire history of the source code in
Section 2 presents attempts to methodically analyse freely accessible version control repositories.
and compare similar technologies that can be found in
the scientific literature and puts our research in per-
3.1 Considered Database Access Technologies
spective. Section 3 presents the approach we followed
for collecting the data required for our empirical study In previous work [13, 14], we considered 26 Java rela-
as well as the methodology for analysing it. The next tional database technologies that o↵er a direct means
three sections address our research questions. Sec- of accessing a relational database and whose presence
tion 7 discusses the threats to validity of our study. in a project is identifiable through static analysis. By
Section 8 discusses possible extensions of the presented analysing the import statements in Java files as well
study, and Section 9 concludes. as the presence of specific configuration files, we deter-
mined the presence of each of these technologies. We counterparts of database entities. The Java Persis-
performed a survival analysis of the technologies used tence API 3 (denoted by jpa hereafter) is the de facto
in order to determine their relative importance over Java standard for annotation-based mappings. jpa was
time in the considered projects. first released in 2006, and relies on the Java annota-
This paper provides a more in-depth study, by look- tion mechanism that was first introduced in Java 5.
ing at the interaction between object-oriented source We consider this technology as representative for this
code and relational databases at a more fine-grained kind of mapping description.
level. We have selected three popular technologies that In our study we consider that a Java file relates to
are representative of a particular way to connect the jpa if the Entity, Embeddable, or MappedSuperclass
source code to a database (embedded SQL, external annotations from package javax.persistence can be
mapping files, and Java annotations): found in this file.
JDBC Discussion
jdbc1 is a low-level technology for connecting Java pro- As witnessed by many discussions on Stack Overflow4 ,
grams to a database by sending SQL queries directly there is no consensus on which of these three technolo-
from within the source code. While version 1.1 was gies is the most appropriate for any given project, as
released in 1997, there have been regular version up- it may depend on many project-related characteristics,
grades to cope with the evolution of the Java language. technological choices or even personal preferences.
This technology is still intensively used in numerous One should also note that the use of these technolo-
projects [13], despite the inherently close coupling that gies is not exclusive. A project may use all of these
is required between the source code and the database technologies simultaneously. These technologies may
schema. even be used together within the same Java source
In our study we consider this technology as being code files.
associated to a Java source code file if entities belong-
ing to java.sql are imported in this file. 3.2 Selected Projects
In order to obtain a representative project sample, we
Hibernate based our empirical analyses on Java projects belong-
ing the GitHub project corpus proposed by Allamanis
ORM technologies rely on a mapping description for and Sutton [16]. Among these projects, 13,307 still
associating (object-oriented) source code elements to had an available Git repository on 24 March 2015.
database elements. They aim to reduce the so-called In order to carry out our empirical study, we se-
object-relational impedance mismatch [15]. The map- lected 2,457 projects from this project corpus for which
ping description can take the form of configuration at least one of the commits contained a reference to
files, placed aside source code files, to express the re- either jdbc, jpa or hbm. For each selected project, we
lations between the considered entities. Hibernate is extracted the existing relations between source code
a popular open source Java framework adopting this and database entities from the first commit of each
solution. It was first released in 2001, and provides an week, and we obtained an historical view of all the
abstraction layer on top of jdbc. Hibernate has been files that can be related to a particular technology or
criticised by many of not being a 100% transparent to a particular framework.
data persistence solution.
mean stdev median max.
In our study we analyse Hibernate2 XML config-
duration (in weeks) 76 121 23 812
uration files (denoted by hbm hereafter), and con-
# commits 1317 6013 126 174,618
sider that a Java file relies on Hibernate technology # contributors 12 31 4 1091
if at least one Hibernate configuration file mentions # files in HEAD 1058 3549 213 103,493
the Java file as a code entity resource. # Java files in HEAD 512 1793 88 46,661
Table 1: Characteristics of the selected projects.
JPA
HEAD refers to the latest extracted version.
Annotation-based mapping descriptions o↵er an in-
Table 1 shows some of the characteristics of the se-
creasingly popular means to express the relations re-
lected projects. The distribution of metrics values is
quired by ORM engines. With such mappings, Java
annotations are used to mark program elements as 3 oracle.com/technetwork/java/javaee/tech/
persistence-jsp-140049.html
1 oracle.com/technetwork/java/javase/jdbc/ 4 see for example stackoverflow.com/questions/Q
2 hibernate.org/ with Q = 1607819, 2397016, 2560500 or 530215.
two distributions of the introduction time of the tech-
nology in a project. The first distribution (left) con-
siders the first time a technology gets introduced in
a project. The second distribution (right) considers
the introduction of the technology in a project that
already had a technology before. As expected, we ob-
serve that more than 50% of the introductions
of a first technology are done in the first 10%
of the project’s lifetime. For technologies intro-
duced after an existing one, the distribution tends to
be flatter.
We also observe that the two distributions for jdbc
Figure 1: Number of projects per considered technol- present less di↵erences than the ones related to jpa or
ogy. hbm. To achieve this, we performed a Kolmogorov-
Smirnov statistical test for each pair of distributions
highly skewed, suggesting evidence of a Pareto princi- related to jdbc, jpa and hbm. The tests show that the
ple [17]. The duration is expressed in weeks between two distributions associated to each technology
the first and the last commit. are significantly di↵erent (p-values are lower than
Figure 1 reports the number of projects per con- 10 6 ). This may indicate that for jdbc, the moment
sidered technology, taking the entire lifetime of each of introduction is less a↵ected by the presence
project into account. We observe that the project sam- of another technology than for hbm and jpa.
ple is relatively unbalanced with respect to the pres- We saw that the time at which a technology is in-
ence of each technology, but each pair of technologies troduced in a project varies depending on the presence
is still represented in a quite a number of projects. of another technology in this project. What are the
technologies that are more likely to be succeeded by
4 RQ1 When and in which order are another one?
database technologies introduced in To answer this question, we use the statistical tech-
nique of survival analysis to estimate the probability
a project? that a technology does not remain the last introduced
Introducing a new technology in a software project one in a project lifetime. Survival analysis [18] creates
comes with a certain cost. A common policy is there- a model estimating the survival rate of a population
fore to introduce such a technology only if the expected over time, considering the fact that some elements of
benefits outweigh the expected cost. the population may leave the study, and for some other
For each project, we analysed at what moment in elements the event of interest does not occur during the
the projects’ lifetime each considered technology got observation period. In our case, the observed event is
introduced. The answer appears to depend on the du- the introduction in a project of another technology af-
ration of the considered projects. To minimise the ter an existing one.
e↵ect of project duration, we normalised the lifetime
of each project into a range between 0 (the start of the
project) and 1 (the last considered commit).
Figure 3: Probability that a technology remains the
last introduced technology over time.
Figure 3 shows the survival rates for each consid-
Figure 2: Violin plot (using a kernel density estimate) ered technology. We observe that hbm has a much
of the distribution of the introduction time of a tech- lower survival rate (i.e., a lower probability of staying
nology in the Java project corpus. the last introduced technology for a long time) than
the other technologies. We also observe that, during
Figure 2 compares, for each considered technology, the first 10% of the projects’ lifetime, the survival rates
of hbm decrease by 30%, representing a more impor- 5 RQ2 How does the introduction of
tant decrease than for the other two technologies. This a new technology in a project a↵ect
implies that hbm is usually quickly replaced or the already included ones?
complemented by another technology.
Figure 1 showed that around 23% of the projects As multiple database access technologies are used in
use two or more database technologies in their lifetime, many projects, either simultaneously or one after the
but these are not necessarily used simultaneously. We other, it is useful to study how the introduction of a
therefore identified which combinations of technologies new technology can impact the use of an already in-
actually co-occur in the selected Java projects. Fre- cluded one. This impact, if it occurs, could result in an
quent co-occurrences would reveal which technologies increased or decreased usage of the already included
are complementary, and which technologies are used technology. We therefore identified and counted for
as supporting technologies of other ones. For each which projects the introduction of a new technology
pair of technologies, we counted the number of projects causes an increasing use of the older technology, a de-
in which these technologies actually co-occur, and in creasing use, or no observable change in the use of the
which order they were introduced in these projects. already included technology.
The results are summarised in Table 2. To qualify the impact, we rely on the first derivative
of the number of files related to an existing technology.
(A, B) ! (jdbc, jpa) (jdbc, hbm) (jpa, hbm)
# projects 497 152 84
We computed and compared the mean of this deriva-
# co-occurrences 488 148 77 tive for two 8-week periods: the first period strictly
% co-occurrences 98.2% 97.4% 91.7% precedes the moment of introduction of the new tech-
startA < startB 157 50 19 nology, and the second period immediately follows the
startA > startB 151 27 37
startA = startB 189 75 28
moment of introduction.
In the following, we will use the term variation to
Table 2: Projects characteristics by pairs (A, B) of co- denote the di↵erence between the mean of the second
occurring technologies period and the mean of the first period. The variation
of a technology is easy to interpret: a positive value
Among all projects that use multiple tech- indicates an increasing use of the existing technology
nologies during their lifetime we observe a while a negative value indicates a decreasing use of the
very high proportion of co-occurring technolo- existing technology
gies. More specifically, in 97.3% (488+148+77 out of
497+152+84) of all the situations in which two dis-
tinct technologies were used during a project’s life-
time, they were used simultaneously. Around 41%
(189+75+28 out of 488+148+77) of all pairs of co-
occurring technologies were introduced simultaneously
(startA = startB ), implying that around 59% of all
pairs of co-occurring technologies concern projects in
which the technologies were introduced at di↵erent
moments (startA 6= startB ).
Considering the number of projects in which the Figure 4: Impact of the introduction of a new technol-
introduction of a technology A was observed before ogy on the activity of an already included technology.
the use of a technology B, it seems that jpa tends to
succeed to hbm more often than the contrary Figure 4 shows the distribution of the variation for
(37 versus 19 observations). Similarly, hbm tends to each pair of technologies. We observe that jdbc and
succeed to jdbc more often than the contrary hbm cause a slight positive impact on the use of
(50 versus 27 observations). We did not identify such existing technologies (since the variation tends to
an order for jpa and jdbc (151 versus 157 observations). be positive in 75% of all cases). Notice the important
variation induced by introducing hbm in projects using
Summary. All considered technologies are in-
jpa. The converse is not true: introducing jpa in a
troduced early in the projects’ lifetimes, even for
project that already uses hbm implies a negative
projects that already use another technology. The
variation for hbm.
number of projects in which multiple technologies
co-occur is proportionally important. The order Figure 4 only identifies global trends in our project
in which these technologies are introduced sug- corpus. It does not allow to identify trends within
gests that hbm is often succeeded by jdbc or jpa. individual projects. Figure 5 therefore distinguishes
the projects that exhibit a positive variation (blue
curve), a negative variation (red curve) or no varia-
tion (green curve) for several time intervals after the
introduction of the new technology.
Figure 6: Probability that at least 25% of files related
to a technology remain after the introduction of an-
other technology.
Summary. Introducing a new technology gen-
erally induces, in the short term, an increase of
Figure 5: Number of projects with an increasing, de- the presence of the already included technology,
creasing or stable activity of an already included tech- with the notable exception of the introduction of
nology, as observed x weeks after introducing another jpa on a project that already makes use of hbm.
technology. This suggests that, contrary to the promises of
ORM technologies, new technologies do not tend
Regardless of the considered pair of technologies, to replace existing ones but rather complement
with the notable exception of the pairs (jpa after hbm) them.
and (hbm after jpa), both the number of projects hav-
ing no variation and the number of projects having a 6 RQ3 To which extent does the intro-
positive variation are systematically greater than the
number of projects exhibiting a negative variation.
duction of a technology impact the
Figure 6 shows survival curves, using a Kaplan- way in which a project accesses the
Meier estimator, of the probability that a project keeps database?
more than a threshold of 25% of its files related to an
From the results of RQ1 we observed that, if a project
already included technology after the introduction of
uses multiple database access technologies over its life-
new one. We tried di↵erent threshold values and they
time, these technologies tend to co-occur. At a more
all lead to the same conclusions.
fine-grained level, we are interested in the impact of
Again, we observe that the most distinct be-
the introduction of a technology on the files that al-
haviours are exhibited by jpa and hbm: the prob-
ready relate to a previously used technology.
ability to keep more than 25% of files related to hbm
drops below 0.55 about 20 weeks after introducing jpa,
while the probability for jpa files drops to a little more 6.1 Do di↵erent technologies co-occur at file
than 0.6 about 19 weeks after introducing hbm. This level?
analysis corroborates our previous observations: in- Let us first study the co-occurrences of di↵erent tech-
troducing jpa or hbm does not negatively im- nologies at file level without taking the evolutionary
pact the use of jdbc, and conversely. We also ob- aspect into account. Figure 7 shows, for each pair of
serve from Figures 5 and 6 that most of the impact technologies, the distribution across projects of the ra-
happens in the first weeks after introducing the tio between the number of files that relate to each, or
new technology. both, technologies, and the number of files that relate
to any of these technologies. For each pair of tech-
nologies, only projects in which both technologies have Let us associate a migration profile to each project
been used at some point in their lifetime have been re- at di↵erent points in time after the introduction of the
tained as elements of the distribution. new technology. This migration profile reflects how
the files related to the old technology are impacted. It
is computed as follows:
Let P be a project and T = {jdbc, hbm, jpa} the
considered technologies. For each point in time t for
P and each technology A 2 T we define relatedP (A, t)
as the (possibly empty) subset of (fully qualified) file-
names of P in which technology A was detected at
time t.
For every pair of distinct technologies (A, B) 2
Figure 7: Relative number of files relating to pairs of T ⇥ T , we write M = (P, A, B) if P is a project in
technologies. which technology B gets introduced while a technol-
ogy A is already in use. Let tM denote the point in
It turns out that pairs of technologies including jdbc time of this introduction and FM = relatedP (A, tM )
present similar profiles: most projects contain a small the set of filenames associated to technology A. For
proportion of files using both technologies. A two- each t tM we associate to each f 2 FM a label
sided Kolmogorov-Smirnov test confirms this similar- in L = {residual, removed, complemented, replaced} as
ity between distributions: we cannot reject the null hy- follows:
pothesis that states that the distributions associated residual if f 2 relatedP (A, t) \ relatedP (B, t)
to the proportion of files using a single technology are removed if f 2/ relatedP (A, t) [ relatedP (B, t)
identical (p = 0.877 and 0.287, respectively). We con- complemented if f 2 relatedP (A, t) \ relatedP (B, t)
clude that jdbc is generally not used in the same replaced if f 2 relatedP (B, t) \ relatedP (A, t)
files as jpa and hbm. Given M , we also associate to each t tM a set of
The pair of technologies jpa and hbm presents a dif- labels mpM (t) ✓ L. A label L 2 L belongs to mpM (t)
ferent behaviour. The three distributions of the pro- if, among the labels associated to each f 2 FM at time
portion of files that only relate to these technologies t, no other label occurs more frequently than L.
are significantly di↵erent (we reject the null hypothesis Finally, the migration profile of M at time t is a
with p < 0.001). This result, combined with the form unique label from mpM (t) selected based on the total
of the distributions, suggests that, for projects having order replaced > complemented > removed > residual.
used jpa and hbm, a file is likely to relate either This total order privileges migration profiles that cor-
to jpa only or to both jpa and hbm. In addition to respond to the adoption of the new technology.
this, the proportion of files that use both hbm and jpa As the choice of a total order could have altered
is more important than for the other considered pairs the results of our analysis, we compared the results
of technologies. obtained with several total orders, and we observed
only slight local variations. This is not surprising as
Summary. There is a clear separation between there are only 72 pairs (M, t) such that |mpM (t)| > 1,
files using jdbc and files using the two other tech- representing 1.78% of all the considered pairs.
nologies. For the combination of hbm and jpa, Figure 8 shows the evolution of the proportion of
a partial, asymmetric overlap exists at file level: projects with a given migration profile. For the sake of
hbm is often used in the same files as jpa, while jpa readability, we only present results for complemented ,
is rarely used in combination with another tech- replaced , and removed . The results for residual can
nology in the same file. be deduced from these, by taking the complement of
complemented , replaced and removed .
We observe that, for each considered pair of tech-
6.2 How does the co-occurrence of technolo-
nologies, and for each time delay (expressed in weeks)
gies at file level evolve over time?
after the introduction of the new technology, most
Let us now look at the same question from an evo- projects relate to the residual migration profile, im-
lutionary point of view, by assessing the impact, at plying that projects tend not to adapt their existing
file-level, of introducing a new technology in a project database access files to make use of the newly intro-
that already uses another technology to access the duced technology. This is especially true for projects
database. To do this, we study how the files related to introducing jdbc after jpa or hbm.
an existing technology get changed after introduction The second dominant migration profile is removed .
of the new technology. Regardless of the considered pair of technologies, more
7 Threats to validity
Our research su↵ers from the same threats as other
research relying on Git and GitHub [19, 20].
The selected Java projects potentially su↵er from
the same generalisability constraints as in [16]. The
open source GitHub Java project corpus was curated
to exclude low-quality projects (by ignoring projects
that were never forked) and project duplicates.
While our corpus contained 2,457 projects, the
number of projects involved in some pairs of database
technologies were sometimes much lower. For example,
only 19 projects were concerned by a migration from
jpa to hbm (cf. Table 2). The accuracy of our obser-
vations could be increased by using a larger project
corpus.
The detection of a technology is based on the static
analysis of code and project-specific artefacts (e.g.,
Java annotations, import statements and XML files).
This approach can lead to false positives: the presence
of these artefacts does not necessarily reflect the actual
use of the related technology.
Some of our analyses are based on arbitrarily chosen
Figure 8: Proportion (stacked) of projects for each thresholds and on weekly time intervals. Because our
migration profile. The complement corresponds to re- results may depend on these thresholds and intervals,
placed . we repeated our experiments with di↵erent parameters
but did not observe any major di↵erences.
and more projects are associated to this migration pro-
file. Over time, an increasing number of projects tend
to reduce the number of files relating to the first con- 8 Future Work
sidered technology. The predominance of residual and The results presented in this article, possibly com-
removed migration profiles seems to convey that, in bined with more traditional project quality metrics,
many cases, files that related to the existing could be integrated in a managerial dashboard. Such
technology are not prone to use the newly in- a dashboard could be used to compare the character-
troduced technology. Instead, they either continue istics and the evolution of a particular project against
to use the first technology or they tend to lose any those belonging to the analysed project corpus. This
relation to database access management. would support project managers in evaluating and ex-
The two other migration profiles, complemented ploiting the expected benefits and disadvantages from
and replaced , indicate an e↵ective file migration from introducing a new technology, as well as in assessing
the existing technology to the newly introduced one. the impact of how this technology will become used
Such cases appear to be much less represented in our in the project over time. Any ensuing managerial de-
corpus, with the exception of projects in which jpa or cisions will obviously depend on project-specific rules
jdbc is introduced after hbm. This is especially the and guidelines that could hardly be generalized.
case when jpa is introduced in a project using This paper used static analysis techniques to de-
hbm: the files that were related to hbm become tect the presence of a particular technology. Using dy-
(sometimes exclusively) related to jpa. namic analysis techniques could reveal how database
technologies are actually used in running systems. The
Summary. Di↵erent technologies generally do
analysis of queries submitted to the database at run-
not tend to co-occur in the same set of files, ex-
time could be used for understanding to which extent
cept, to some extent, when jpa and hbm are used
ORM technologies hide complexity to developers.
together. We do not observe a true migration in
technology usage: files that are related to a given This paper focused on relational database access
technology do not tend to adopt the newly in- technologies based on three representative technologies
troduced technology, except for projects that mi- (jdbc, Hibernate and jpa). It could be useful to include
grate from hbm to another technology. other Java specifications for object persistence as well,
such as JDO. It would also be useful to consider other
kinds of databases (such as NoSQL, graph or object- low-level jdbc solution is massively replaced by hbm or
oriented databases), since these are becoming increas- jpa. The only significant technology migration we ob-
ingly more popular. A follow-up study could take into served concerns the transition from hbm to jpa. More
account such alternative database technologies. specifically, we summarise our main observations be-
Other technological domains (beyond databases) low.
could be considered as well. Event loggers, graph- We analysed the evolution and co-occurrences of
ical user interfaces, and unit tests are examples of the technologies in order to get a high-level view of
features supported by multiple concurrent technolo- their usage in the considered Java projects. It appears
gies. Since the identification of the technology used in that, most of the time, database technologies are in-
project files is the only part of our methodology that troduced early in the projects’ lifetime, whether they
depends on the considered technologies, our approach are the first technology introduced or not. Once intro-
could be easily adapted to study other technologies. duced in a project, hbm tends to be complemented or
Section 7 mentioned the limitations of the selected replaced by another technology more frequently and
project corpus. We therefore intend to confirm our more quickly than jpa and jdbc.
research results by considering a larger project corpus, We also analysed how the technologies are used in
including both open and closed source projects. We the source code files. The introduction of jdbc and
also intend to study the e↵ect of project quality and hbm tends to be followed by an increasing use of the
project maturity on the obtained results. Finally, we already present database technology. This increase is
intend to include other programming languages than particularly important when hbm is introduced after
Java in the project corpus in order to avoid any bias jpa. Conversely, the introduction of jpa reduces the
introduced by language-specific characteristics. use of hbm. jpa therefore appears to replace existing
While this paper only focused on technical aspects hbm in the database-related source-code files, while the
of connecting source code to databases, we plan to converse is not true.
study the social aspects of systems involving such a Furthermore, jdbc generally does not share source
database connection. More precisely, we would like to code files with the two other considered database tech-
determine if the di↵erent technologies are introduced nologies. While jpa is used in isolation in a majority
and managed by di↵erent teams or persons. Inspired of source code files, hbm tends to be used more often
by [21] we also aim to analyse the developer character- in conjunction with jpa. The study of the evolution of
istics in order to determine how these a↵ect the take- such co-occurrence reveals that a file migration from a
up, use, evolution and migration of technologies. Some technology to another one is only observed from hbm
examples of developer characteristics are their degree to jpa. In most projects, the introduction of a new
of specialisation, diversity, seniority, skills, and work- database technology is not followed by a massive adop-
load. tion of this technology by the existing database-related
Finally, we plan to analyse software systems in or- files, until these files become database-unrelated or are
der to automatically identify library features used in removed from the source code repository.
the source code, as well as feature similarities between Exploiting all these results in a dashboard that sup-
di↵erent technologies. In situations where developers ports managers in making project-specific decisions
want to migrate from a given technology to another, with respect to the introduction, use or evolution of
such a feature identification and mapping is a first step database access technologies remains part of future
towards better support for assisted or automatic mi- work.
gration [22].
Acknowledgment
9 Conclusions This research was conduced as part of the FRFC
research project T.0022.13 “Data-Intensive Software
Through static analysis of Java source code we carried
System Evolution” that was financed by the F.R.S.-
out a large-scale empirical study to understand how
FNRS, Belgium.
database access technologies interact with one another.
We considered three popular technologies (jdbc, hbm
and jpa) that represent di↵erent means to connect Java
source code files to a relational database. We selected
data from 2,457 open source projects on GitHub that
used at least one of the considered technologies.
Our study revealed common behaviours in the use
of these three technologies. In spite of the promises
of ORM technologies, we found no evidence that the
References [12] C. Teyton, J. Falleri, M. Palyart, and X. Blanc,
“A study of library migrations in Java,” Jour-
[1] E. Rahm and P. A. Bernstein, “An online bib-
nal of Software: Evolution and Process, vol. 26,
liography on schema evolution,” SIGMOD Rec.,
no. 11, pp. 1030–1052, 2014.
vol. 35, no. 4, pp. 30–31, Dec. 2006.
[13] M. Goeminne and T. Mens, “Towards a survival
[2] D. Sjoberg, “Quantifying schema evolution,” In- analysis of database framework usage in Java
formation and Software Technology, vol. 35, no. 1, projects,” in Int’l Conf. Software Maintenance
pp. 35 – 44, 1993. and Evolution, 2015.
[3] P. Vassiliadis, A. V. Zarras, and I. Skoulis, “How [14] M. Goeminne, A. Decan, and T. Mens,
is life for a table in an evolving relational schema? “Co-evolving code-related and database-related
Birth, death and everything in between,” in Int’l changes in a data-intensive software system,” in
Conf. Conceptual Modeling (ER), 2015, pp. 453– CSMR-WCRE Software Evolution Week, 2014,
466. pp. 353–357.
[15] M. N. C. Ireland, D. Bowers and K. Waugh,
[4] A. S. Christensen, A. Møller, and M. I.
“A classification of object-relational impedance
Schwartzbach, “Precise analysis of string expres-
mismatch,” in Intl Conf. Advances in Databases,
sions,” in Int’l Conf. Static Analysis (SAS), 2003,
Knowledge, and Data Applications (DBKDA),
pp. 1–18.
2009, pp. 36–43.
[5] C. Gould, Z. Su, and P. Devanbu, “Static checking [16] M. Allamanis and C. Sutton, “Mining source code
of dynamically generated queries in database ap- repositories at massive scale using language mod-
plications,” in Int’l Conf. Software Engineering. eling,” in Int’l Conf. Mining Software Reposito-
IEEE Comp. Soc., 2004, pp. 645–654. ries. IEEE, 2013, pp. 207–216.
[6] M. Sonoda, T. Matsuda, D. Koizumi, and S. Hi- [17] M. Goeminne and T. Mens, “Evidence for the
rasawa, “On automatic detection of SQL injec- Pareto principle in open source software activity,”
tion attacks by the feature extraction of the single in Workshop on Software Quality and Maintain-
character,” in Int’l Conf. Security of Information ability (SQM), ser. CEUR Workshop Proceedings,
and Networks (SIN), 2011, pp. 81–86. vol. 701. CEUR-WS.org, 2011, pp. 74–82.
[18] I. Samoladas, L. Angelis, and I. Stamelos, “Sur-
[7] S. R. Clark, J. Cobb, G. M. Kapfhammer, J. A.
vival analysis on the duration of open source
Jones, and M. J. Harrold, “Localizing SQL faults
projects,” Information & Software Technology,
in database applications,” in Int’l Conf. Auto-
vol. 52, no. 9, pp. 902–922, 2010.
mated Software Engineering (ASE), 2011, pp.
213–222. [19] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamil-
ton, D. M. Germán, and P. T. Devanbu, “The
[8] M. A. Javid and S. M. Embury, “Diagnosing promises and perils of mining Git,” in Int’l Conf.
faults in embedded queries in database applica- Mining Software Repositories, 2009, pp. 1–10.
tions,” in EDBT/ICDT’12 Workshops, 2012, pp.
239–244. [20] E. Kalliamvakou, G. Gousios, K. Blincoe,
L. Singer, D. M. Germán, and D. Damian, “The
[9] M. Linares-Vasquez, B. Li, C. Vendome, and promises and perils of mining GitHub,” in Int’l
D. Poshyvanyk, “How do developers document Conf. Mining Software Repositories, 2014, pp. 92–
database usages in source code?” in Int’l Conf. 101.
Automated Software Engineering (ASE), 2015.
[21] B. Vasilescu, A. Serebrenik, M. Goeminne, and
[10] V. Bauer and L. Heinemann, “Understanding API T. Mens, “On the variation and specialisation of
usage to support informed decision making in workload: A case study of the Gnome ecosystem
software maintenance,” in European Conf. Soft- community,” J. Empirical Software Engineering,
ware Maintenance and Reengineering, 2012, pp. pp. 1–54, 2013.
435–440. [22] C. Teyton, J.-R. Falleri, and X. Blanc, “Auto-
matic discovery of function mappings between
[11] C. Teyton, J. Falleri, and X. Blanc, “Mining li- similar libraries,” in Working Conf. Reverse En-
brary migration graphs,” in Working Conf. Re- gineering, Oct 2013, pp. 192–201.
verse Engineering, 2012, pp. 289–298.