=Paper=
{{Paper
|id=Vol-1820/paper-03
|storemode=property
|title=On the Interaction of Relational Database Access Technologies in Open Source Java Projects
|pdfUrl=https://ceur-ws.org/Vol-1820/paper-03.pdf
|volume=Vol-1820
|authors=Alexandre Decan,Mathieu Goeminne,Tom Mens
|dblpUrl=https://dblp.org/rec/conf/sattose/DecanGM15
}}
==On the Interaction of Relational Database Access Technologies in Open Source Java Projects==
<pdf width="1500px">https://ceur-ws.org/Vol-1820/paper-03.pdf</pdf>
<pre>
        On the Interaction of Relational Database Access
          Technologies in Open Source Java Projects

                       Alexandre Decan? , Mathieu Goeminne?† and Tom Mens?
                        ?
                      Software Engineering Lab, University of Mons, Belgium
                                Email: { first . last } @ umons.ac.be
         †
           Center of Excellence in Information and Communication Technologies, Belgium
                                 Email: mathieu.goeminne@cetic.be


                                                              1   Introduction

                       Abstract                               As software systems become more and more complex,
                                                              the e↵ort required for creating new systems and main-
                                                              taining existing ones increases over time. This ef-
    This article presents an empirical study of               fort can be reduced by embedding code in reusable
    how the use of relational database access tech-           libraries that o↵er services for supporting a particular
    nologies in open source Java projects evolves             aspect of the developed system. For example, for soft-
    over time. Our observations may be useful                 ware systems that strongly interact with a relational
    to project managers to make more informed                 database, numerous technologies (libraries, APIs and
    decisions on which technologies to introduce              frameworks) exist for connecting the program code to
    into an existing project and when. We se-                 the database. Understanding how database technolo-
    lected 2,457 Java projects on GitHub using                gies tend to replace or complement existing ones in
    the low-level JDBC technology and higher-                 software projects can help project managers in choos-
    level object relational mappings such as Hi-              ing the most appropriate technology, and the most ap-
    bernate XML configuration files and JPA an-               propriate moment of introducing this technology.
    notations. At a coarse-grained level, we anal-               The program code can be connected to the database
    ysed the probability of introducing such tech-            in various ways. In the simplest case, the code will
    nologies over time, as well as the likelihood             contain embedded database queries (e.g., SQL state-
    that multiple technologies co-occur within the            ments) that will be interpreted by the database man-
    same project. At a fine-grained level, we anal-           agement system. In more complex cases, especially
    ysed to which extent these di↵erent technolo-             for object-oriented programs, object-relational map-
    gies are used within the same set of project              pings (ORM) will be provided to translate program
    files. We also explored how the introduction              concepts (e.g., classes, methods and attributes) into
    of a new database technology in a Java project            database concepts (e.g., tables, columns and values),
    impacts the use of existing ones. We ob-                  so that database elements can be created, read, up-
    served that, contrary to what could have been             dated or deleted (CRUD) directly by manipulating
    expected, object-relational mapping technolo-             object-oriented views. Despite the fact that ORMs
    gies do not tend to replace existing ones but             abstract away from technical connection details in or-
    rather complement them.                                   der to facilitate software development, some evolution-
                                                              related problems remain.
Copyright c 2016 by the paper’s authors. Copying permitted       The high level of dynamic of current database ac-
for private and academic purposes. This volume is published   cess technologies makes it hard for a programmer to
and copyrighted by its editors.
                                                              figure out which SQL queries will be executed at a
In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE
2015 Seminar on Advanced Techniques and Tools for Software
                                                              given location of the program source code, or which
Evolution, University of Mons, Belgium, 6-8 July 2015,        source code methods actually access a given database
published at http://ceur-ws.org                               table or column. Conversely, the high level of ab-
straction provided by the ORMs makes it hard to de-        2     State of the Art
termine the impact on the program code of changes
                                                           While the literature on database schema evolution is
in the database schema. In addition, co-evolving the
                                                           very large [1], few authors have proposed approaches
database and the program requires to master multiple
                                                           to systematically observe how developers cope with
languages and technologies.
                                                           database evolution in practice. Sjoberg [2] presented a
                                                           study where the database schema evolution of a large-
   This paper examines how popular technologies are        scale medical application is measured and interpreted.
used in open source Java projects for connecting the       Vassiliadis et al. [3] studied the evolution of individual
source code to a relational database. To do so, we         database tables over time in eight di↵erent software
focus on three research questions:                         systems.
                                                              Several researchers have tried to identify, extract
   RQ1 – When and in which order are database tech-        and analyse database usage in application programs.
nologies introduced in a project? We observe that they     The purpose of the proposed approaches ranges from
tend to be introduced very early in the project’s life-    error checking [4, 5, 6], over SQL fault localisa-
time. This is expected, since those technologies are       tion [7], to fault diagnosis [8]. More recently, Linares-
typically central components of the projects in which      Vasquez et al. [9] studied how developers document
they occur. We also observe that multiple database ac-     database usage in source code. Their results show
cess technologies are used in many projects, and that      that a large proportion of database-accessing methods
they tend to be used simultaneously. Finally, we study     is completely undocumented.
which technologies tend to be complemented by other           Several empirical studies have analysed the evolu-
technologies.                                              tion of library and technology usage. Bauer and Heine-
   RQ2 – How does the introduction of a new technol-       mann [10] were able to identify distinct evolution sce-
ogy in a project a↵ect the already included ones? With     narios for API dependencies in software projects. The
this question we wish to understand whether technolo-      gained knowledge may be useful for evaluating oppor-
gies tend to replace existing ones, or rather comple-      tunities in API migration and evolution. Teyton et
ment them. In the former case, the introduction of a       al. [11] identified sets of similar libraries in a large
new technology would decrease the use of the already       corpus of software projects. The obtained results can
included technology. In the latter case, the new tech-     be used for suggesting alternative libraries to project
nology may serve as a catalyst, leading to an increased    managers who want to migrate from a library to an-
of the already included technology.                        other one. In [12] they investigate how and why library
                                                           migrations occur. They found that library migrations
    RQ3 – To which extent does the introduction of a       are relatively rare, and projects that have witnessed
new technology impact the way in which a project ac-       more than one migration are exceptional. They also
cesses the database? This question focuses on the evo-     observed that migration is generally an atomic change
lution of project files that use a particular technol-     performed by a single developer in a single commit.
ogy, after introducing a new database technology in
the project: are these files modified in order to bene-
fit from the newly introduced technology? For certain      3     Methodology and Data Extraction
pairs of technologies, we found this to be the case. For   The empirical study in this paper focuses on open
most pairs of technologies however, existing database-     source Java systems. Java is among the most popular
related files do not substantially adopt the latest in-    programming languages today, and a large number of
troduced technology.                                       technologies and frameworks are available to facilitate
                                                           relational database access from within Java code. The
                                                           choice for open source systems is motivated by the ac-
   The remainder of this paper is structured as follows.   cessibility of the entire history of the source code in
Section 2 presents attempts to methodically analyse        freely accessible version control repositories.
and compare similar technologies that can be found in
the scientific literature and puts our research in per-
                                                           3.1    Considered Database Access Technologies
spective. Section 3 presents the approach we followed
for collecting the data required for our empirical study   In previous work [13, 14], we considered 26 Java rela-
as well as the methodology for analysing it. The next      tional database technologies that o↵er a direct means
three sections address our research questions. Sec-        of accessing a relational database and whose presence
tion 7 discusses the threats to validity of our study.     in a project is identifiable through static analysis. By
Section 8 discusses possible extensions of the presented   analysing the import statements in Java files as well
study, and Section 9 concludes.                            as the presence of specific configuration files, we deter-
mined the presence of each of these technologies. We        counterparts of database entities. The Java Persis-
performed a survival analysis of the technologies used      tence API 3 (denoted by jpa hereafter) is the de facto
in order to determine their relative importance over        Java standard for annotation-based mappings. jpa was
time in the considered projects.                            first released in 2006, and relies on the Java annota-
   This paper provides a more in-depth study, by look-      tion mechanism that was first introduced in Java 5.
ing at the interaction between object-oriented source       We consider this technology as representative for this
code and relational databases at a more fine-grained        kind of mapping description.
level. We have selected three popular technologies that         In our study we consider that a Java file relates to
are representative of a particular way to connect the       jpa if the Entity, Embeddable, or MappedSuperclass
source code to a database (embedded SQL, external           annotations from package javax.persistence can be
mapping files, and Java annotations):                       found in this file.

JDBC                                                        Discussion

jdbc1 is a low-level technology for connecting Java pro-    As witnessed by many discussions on Stack Overflow4 ,
grams to a database by sending SQL queries directly         there is no consensus on which of these three technolo-
from within the source code. While version 1.1 was          gies is the most appropriate for any given project, as
released in 1997, there have been regular version up-       it may depend on many project-related characteristics,
grades to cope with the evolution of the Java language.     technological choices or even personal preferences.
This technology is still intensively used in numerous          One should also note that the use of these technolo-
projects [13], despite the inherently close coupling that   gies is not exclusive. A project may use all of these
is required between the source code and the database        technologies simultaneously. These technologies may
schema.                                                     even be used together within the same Java source
   In our study we consider this technology as being        code files.
associated to a Java source code file if entities belong-
ing to java.sql are imported in this file.                  3.2   Selected Projects
                                                            In order to obtain a representative project sample, we
Hibernate                                                   based our empirical analyses on Java projects belong-
                                                            ing the GitHub project corpus proposed by Allamanis
ORM technologies rely on a mapping description for          and Sutton [16]. Among these projects, 13,307 still
associating (object-oriented) source code elements to       had an available Git repository on 24 March 2015.
database elements. They aim to reduce the so-called             In order to carry out our empirical study, we se-
object-relational impedance mismatch [15]. The map-         lected 2,457 projects from this project corpus for which
ping description can take the form of configuration         at least one of the commits contained a reference to
files, placed aside source code files, to express the re-   either jdbc, jpa or hbm. For each selected project, we
lations between the considered entities. Hibernate is       extracted the existing relations between source code
a popular open source Java framework adopting this          and database entities from the first commit of each
solution. It was first released in 2001, and provides an    week, and we obtained an historical view of all the
abstraction layer on top of jdbc. Hibernate has been        files that can be related to a particular technology or
criticised by many of not being a 100% transparent          to a particular framework.
data persistence solution.
                                                                                    mean    stdev   median         max.
    In our study we analyse Hibernate2 XML config-
                                                             duration (in weeks)       76     121        23          812
uration files (denoted by hbm hereafter), and con-
                                                             # commits               1317    6013       126      174,618
sider that a Java file relies on Hibernate technology        # contributors            12      31         4         1091
if at least one Hibernate configuration file mentions        # files in HEAD         1058    3549       213      103,493
the Java file as a code entity resource.                     # Java files in HEAD     512    1793        88       46,661

                                                            Table 1: Characteristics of the selected projects.
JPA
                                                            HEAD refers to the latest extracted version.
Annotation-based mapping descriptions o↵er an in-
                                                               Table 1 shows some of the characteristics of the se-
creasingly popular means to express the relations re-
                                                            lected projects. The distribution of metrics values is
quired by ORM engines. With such mappings, Java
annotations are used to mark program elements as               3 oracle.com/technetwork/java/javaee/tech/

                                                            persistence-jsp-140049.html
  1 oracle.com/technetwork/java/javase/jdbc/                   4 see for example stackoverflow.com/questions/Q
  2 hibernate.org/                                          with Q = 1607819, 2397016, 2560500 or 530215.
                                                           two distributions of the introduction time of the tech-
                                                           nology in a project. The first distribution (left) con-
                                                           siders the first time a technology gets introduced in
                                                           a project. The second distribution (right) considers
                                                           the introduction of the technology in a project that
                                                           already had a technology before. As expected, we ob-
                                                           serve that more than 50% of the introductions
                                                           of a first technology are done in the first 10%
                                                           of the project’s lifetime. For technologies intro-
                                                           duced after an existing one, the distribution tends to
                                                           be flatter.
                                                              We also observe that the two distributions for jdbc
Figure 1: Number of projects per considered technol-       present less di↵erences than the ones related to jpa or
ogy.                                                       hbm. To achieve this, we performed a Kolmogorov-
                                                           Smirnov statistical test for each pair of distributions
highly skewed, suggesting evidence of a Pareto princi-     related to jdbc, jpa and hbm. The tests show that the
ple [17]. The duration is expressed in weeks between       two distributions associated to each technology
the first and the last commit.                             are significantly di↵erent (p-values are lower than
   Figure 1 reports the number of projects per con-        10 6 ). This may indicate that for jdbc, the moment
sidered technology, taking the entire lifetime of each     of introduction is less a↵ected by the presence
project into account. We observe that the project sam-     of another technology than for hbm and jpa.
ple is relatively unbalanced with respect to the pres-        We saw that the time at which a technology is in-
ence of each technology, but each pair of technologies     troduced in a project varies depending on the presence
is still represented in a quite a number of projects.      of another technology in this project. What are the
                                                           technologies that are more likely to be succeeded by
4    RQ1 When and in which order are                       another one?
     database technologies introduced in                      To answer this question, we use the statistical tech-
                                                           nique of survival analysis to estimate the probability
     a project?                                            that a technology does not remain the last introduced
Introducing a new technology in a software project         one in a project lifetime. Survival analysis [18] creates
comes with a certain cost. A common policy is there-       a model estimating the survival rate of a population
fore to introduce such a technology only if the expected   over time, considering the fact that some elements of
benefits outweigh the expected cost.                       the population may leave the study, and for some other
   For each project, we analysed at what moment in         elements the event of interest does not occur during the
the projects’ lifetime each considered technology got      observation period. In our case, the observed event is
introduced. The answer appears to depend on the du-        the introduction in a project of another technology af-
ration of the considered projects. To minimise the         ter an existing one.
e↵ect of project duration, we normalised the lifetime
of each project into a range between 0 (the start of the
project) and 1 (the last considered commit).


                                                           Figure 3: Probability that a technology remains the
                                                           last introduced technology over time.
                                                              Figure 3 shows the survival rates for each consid-
Figure 2: Violin plot (using a kernel density estimate)    ered technology. We observe that hbm has a much
of the distribution of the introduction time of a tech-    lower survival rate (i.e., a lower probability of staying
nology in the Java project corpus.                         the last introduced technology for a long time) than
                                                           the other technologies. We also observe that, during
    Figure 2 compares, for each considered technology,     the first 10% of the projects’ lifetime, the survival rates
of hbm decrease by 30%, representing a more impor-             5   RQ2 How does the introduction of
tant decrease than for the other two technologies. This            a new technology in a project a↵ect
implies that hbm is usually quickly replaced or                    the already included ones?
complemented by another technology.
   Figure 1 showed that around 23% of the projects             As multiple database access technologies are used in
use two or more database technologies in their lifetime,       many projects, either simultaneously or one after the
but these are not necessarily used simultaneously. We          other, it is useful to study how the introduction of a
therefore identified which combinations of technologies        new technology can impact the use of an already in-
actually co-occur in the selected Java projects. Fre-          cluded one. This impact, if it occurs, could result in an
quent co-occurrences would reveal which technologies           increased or decreased usage of the already included
are complementary, and which technologies are used             technology. We therefore identified and counted for
as supporting technologies of other ones. For each             which projects the introduction of a new technology
pair of technologies, we counted the number of projects        causes an increasing use of the older technology, a de-
in which these technologies actually co-occur, and in          creasing use, or no observable change in the use of the
which order they were introduced in these projects.            already included technology.
The results are summarised in Table 2.                            To qualify the impact, we rely on the first derivative
                                                               of the number of files related to an existing technology.
        (A, B) !    (jdbc, jpa)   (jdbc, hbm)    (jpa, hbm)
       # projects           497           152             84
                                                               We computed and compared the mean of this deriva-
 # co-occurrences           488           148             77   tive for two 8-week periods: the first period strictly
 % co-occurrences       98.2%           97.4%         91.7%    precedes the moment of introduction of the new tech-
  startA < startB           157             50            19   nology, and the second period immediately follows the
  startA > startB           151             27            37
  startA = startB           189             75            28
                                                               moment of introduction.
                                                                  In the following, we will use the term variation to
Table 2: Projects characteristics by pairs (A, B) of co-       denote the di↵erence between the mean of the second
occurring technologies                                         period and the mean of the first period. The variation
                                                               of a technology is easy to interpret: a positive value
   Among all projects that use multiple tech-                  indicates an increasing use of the existing technology
nologies during their lifetime we observe a                    while a negative value indicates a decreasing use of the
very high proportion of co-occurring technolo-                 existing technology
gies. More specifically, in 97.3% (488+148+77 out of
497+152+84) of all the situations in which two dis-
tinct technologies were used during a project’s life-
time, they were used simultaneously. Around 41%
(189+75+28 out of 488+148+77) of all pairs of co-
occurring technologies were introduced simultaneously
(startA = startB ), implying that around 59% of all
pairs of co-occurring technologies concern projects in
which the technologies were introduced at di↵erent
moments (startA 6= startB ).
   Considering the number of projects in which the             Figure 4: Impact of the introduction of a new technol-
introduction of a technology A was observed before             ogy on the activity of an already included technology.
the use of a technology B, it seems that jpa tends to
succeed to hbm more often than the contrary                       Figure 4 shows the distribution of the variation for
(37 versus 19 observations). Similarly, hbm tends to           each pair of technologies. We observe that jdbc and
succeed to jdbc more often than the contrary                   hbm cause a slight positive impact on the use of
(50 versus 27 observations). We did not identify such          existing technologies (since the variation tends to
an order for jpa and jdbc (151 versus 157 observations).       be positive in 75% of all cases). Notice the important
                                                               variation induced by introducing hbm in projects using
  Summary. All considered technologies are in-
                                                               jpa. The converse is not true: introducing jpa in a
  troduced early in the projects’ lifetimes, even for
                                                               project that already uses hbm implies a negative
  projects that already use another technology. The
                                                               variation for hbm.
  number of projects in which multiple technologies
  co-occur is proportionally important. The order                 Figure 4 only identifies global trends in our project
  in which these technologies are introduced sug-              corpus. It does not allow to identify trends within
  gests that hbm is often succeeded by jdbc or jpa.            individual projects. Figure 5 therefore distinguishes
                                                               the projects that exhibit a positive variation (blue
curve), a negative variation (red curve) or no varia-
tion (green curve) for several time intervals after the
introduction of the new technology.


                                                             Figure 6: Probability that at least 25% of files related
                                                             to a technology remain after the introduction of an-
                                                             other technology.


                                                                 Summary. Introducing a new technology gen-
                                                                 erally induces, in the short term, an increase of
Figure 5: Number of projects with an increasing, de-             the presence of the already included technology,
creasing or stable activity of an already included tech-         with the notable exception of the introduction of
nology, as observed x weeks after introducing another            jpa on a project that already makes use of hbm.
technology.                                                      This suggests that, contrary to the promises of
                                                                 ORM technologies, new technologies do not tend
   Regardless of the considered pair of technologies,            to replace existing ones but rather complement
with the notable exception of the pairs (jpa after hbm)          them.
and (hbm after jpa), both the number of projects hav-
ing no variation and the number of projects having a         6     RQ3 To which extent does the intro-
positive variation are systematically greater than the
number of projects exhibiting a negative variation.
                                                                   duction of a technology impact the
   Figure 6 shows survival curves, using a Kaplan-                 way in which a project accesses the
Meier estimator, of the probability that a project keeps           database?
more than a threshold of 25% of its files related to an
                                                             From the results of RQ1 we observed that, if a project
already included technology after the introduction of
                                                             uses multiple database access technologies over its life-
new one. We tried di↵erent threshold values and they
                                                             time, these technologies tend to co-occur. At a more
all lead to the same conclusions.
                                                             fine-grained level, we are interested in the impact of
   Again, we observe that the most distinct be-
                                                             the introduction of a technology on the files that al-
haviours are exhibited by jpa and hbm: the prob-
                                                             ready relate to a previously used technology.
ability to keep more than 25% of files related to hbm
drops below 0.55 about 20 weeks after introducing jpa,
while the probability for jpa files drops to a little more   6.1    Do di↵erent technologies co-occur at file
than 0.6 about 19 weeks after introducing hbm. This                 level?
analysis corroborates our previous observations: in-         Let us first study the co-occurrences of di↵erent tech-
troducing jpa or hbm does not negatively im-                 nologies at file level without taking the evolutionary
pact the use of jdbc, and conversely. We also ob-            aspect into account. Figure 7 shows, for each pair of
serve from Figures 5 and 6 that most of the impact           technologies, the distribution across projects of the ra-
happens in the first weeks after introducing the             tio between the number of files that relate to each, or
new technology.                                              both, technologies, and the number of files that relate
                                                             to any of these technologies. For each pair of tech-
nologies, only projects in which both technologies have         Let us associate a migration profile to each project
been used at some point in their lifetime have been re-     at di↵erent points in time after the introduction of the
tained as elements of the distribution.                     new technology. This migration profile reflects how
                                                            the files related to the old technology are impacted. It
                                                            is computed as follows:
                                                                Let P be a project and T = {jdbc, hbm, jpa} the
                                                            considered technologies. For each point in time t for
                                                            P and each technology A 2 T we define relatedP (A, t)
                                                            as the (possibly empty) subset of (fully qualified) file-
                                                            names of P in which technology A was detected at
                                                            time t.
                                                                For every pair of distinct technologies (A, B) 2
Figure 7: Relative number of files relating to pairs of     T ⇥ T , we write M = (P, A, B) if P is a project in
technologies.                                               which technology B gets introduced while a technol-
                                                            ogy A is already in use. Let tM denote the point in
   It turns out that pairs of technologies including jdbc   time of this introduction and FM = relatedP (A, tM )
present similar profiles: most projects contain a small     the set of filenames associated to technology A. For
proportion of files using both technologies. A two-         each t      tM we associate to each f 2 FM a label
sided Kolmogorov-Smirnov test confirms this similar-        in L = {residual, removed, complemented, replaced} as
ity between distributions: we cannot reject the null hy-    follows:
pothesis that states that the distributions associated          residual if f 2 relatedP (A, t) \ relatedP (B, t)
to the proportion of files using a single technology are        removed if f 2/ relatedP (A, t) [ relatedP (B, t)
identical (p = 0.877 and 0.287, respectively). We con-          complemented if f 2 relatedP (A, t) \ relatedP (B, t)
clude that jdbc is generally not used in the same               replaced if f 2 relatedP (B, t) \ relatedP (A, t)
files as jpa and hbm.                                           Given M , we also associate to each t tM a set of
   The pair of technologies jpa and hbm presents a dif-     labels mpM (t) ✓ L. A label L 2 L belongs to mpM (t)
ferent behaviour. The three distributions of the pro-       if, among the labels associated to each f 2 FM at time
portion of files that only relate to these technologies     t, no other label occurs more frequently than L.
are significantly di↵erent (we reject the null hypothesis       Finally, the migration profile of M at time t is a
with p < 0.001). This result, combined with the form        unique label from mpM (t) selected based on the total
of the distributions, suggests that, for projects having    order replaced > complemented > removed > residual.
used jpa and hbm, a file is likely to relate either         This total order privileges migration profiles that cor-
to jpa only or to both jpa and hbm. In addition to          respond to the adoption of the new technology.
this, the proportion of files that use both hbm and jpa         As the choice of a total order could have altered
is more important than for the other considered pairs       the results of our analysis, we compared the results
of technologies.                                            obtained with several total orders, and we observed
                                                            only slight local variations. This is not surprising as
  Summary. There is a clear separation between              there are only 72 pairs (M, t) such that |mpM (t)| > 1,
  files using jdbc and files using the two other tech-      representing 1.78% of all the considered pairs.
  nologies. For the combination of hbm and jpa,                 Figure 8 shows the evolution of the proportion of
  a partial, asymmetric overlap exists at file level:       projects with a given migration profile. For the sake of
  hbm is often used in the same files as jpa, while jpa     readability, we only present results for complemented ,
  is rarely used in combination with another tech-          replaced , and removed . The results for residual can
  nology in the same file.                                  be deduced from these, by taking the complement of
                                                            complemented , replaced and removed .
                                                                We observe that, for each considered pair of tech-
6.2   How does the co-occurrence of technolo-
                                                            nologies, and for each time delay (expressed in weeks)
      gies at file level evolve over time?
                                                            after the introduction of the new technology, most
Let us now look at the same question from an evo-           projects relate to the residual migration profile, im-
lutionary point of view, by assessing the impact, at        plying that projects tend not to adapt their existing
file-level, of introducing a new technology in a project    database access files to make use of the newly intro-
that already uses another technology to access the          duced technology. This is especially true for projects
database. To do this, we study how the files related to     introducing jdbc after jpa or hbm.
an existing technology get changed after introduction           The second dominant migration profile is removed .
of the new technology.                                      Regardless of the considered pair of technologies, more
                                                           7    Threats to validity
                                                           Our research su↵ers from the same threats as other
                                                           research relying on Git and GitHub [19, 20].
                                                              The selected Java projects potentially su↵er from
                                                           the same generalisability constraints as in [16]. The
                                                           open source GitHub Java project corpus was curated
                                                           to exclude low-quality projects (by ignoring projects
                                                           that were never forked) and project duplicates.
                                                              While our corpus contained 2,457 projects, the
                                                           number of projects involved in some pairs of database
                                                           technologies were sometimes much lower. For example,
                                                           only 19 projects were concerned by a migration from
                                                           jpa to hbm (cf. Table 2). The accuracy of our obser-
                                                           vations could be increased by using a larger project
                                                           corpus.
                                                              The detection of a technology is based on the static
                                                           analysis of code and project-specific artefacts (e.g.,
                                                           Java annotations, import statements and XML files).
                                                           This approach can lead to false positives: the presence
                                                           of these artefacts does not necessarily reflect the actual
                                                           use of the related technology.
                                                              Some of our analyses are based on arbitrarily chosen
Figure 8: Proportion (stacked) of projects for each        thresholds and on weekly time intervals. Because our
migration profile. The complement corresponds to re-       results may depend on these thresholds and intervals,
placed .                                                   we repeated our experiments with di↵erent parameters
                                                           but did not observe any major di↵erences.
and more projects are associated to this migration pro-
file. Over time, an increasing number of projects tend
to reduce the number of files relating to the first con-   8    Future Work
sidered technology. The predominance of residual and       The results presented in this article, possibly com-
removed migration profiles seems to convey that, in        bined with more traditional project quality metrics,
many cases, files that related to the existing             could be integrated in a managerial dashboard. Such
technology are not prone to use the newly in-              a dashboard could be used to compare the character-
troduced technology. Instead, they either continue         istics and the evolution of a particular project against
to use the first technology or they tend to lose any       those belonging to the analysed project corpus. This
relation to database access management.                    would support project managers in evaluating and ex-
    The two other migration profiles, complemented         ploiting the expected benefits and disadvantages from
and replaced , indicate an e↵ective file migration from    introducing a new technology, as well as in assessing
the existing technology to the newly introduced one.       the impact of how this technology will become used
Such cases appear to be much less represented in our       in the project over time. Any ensuing managerial de-
corpus, with the exception of projects in which jpa or     cisions will obviously depend on project-specific rules
jdbc is introduced after hbm. This is especially the       and guidelines that could hardly be generalized.
case when jpa is introduced in a project using                This paper used static analysis techniques to de-
hbm: the files that were related to hbm become             tect the presence of a particular technology. Using dy-
(sometimes exclusively) related to jpa.                    namic analysis techniques could reveal how database
                                                           technologies are actually used in running systems. The
  Summary. Di↵erent technologies generally do
                                                           analysis of queries submitted to the database at run-
  not tend to co-occur in the same set of files, ex-
                                                           time could be used for understanding to which extent
  cept, to some extent, when jpa and hbm are used
                                                           ORM technologies hide complexity to developers.
  together. We do not observe a true migration in
  technology usage: files that are related to a given         This paper focused on relational database access
  technology do not tend to adopt the newly in-            technologies based on three representative technologies
  troduced technology, except for projects that mi-        (jdbc, Hibernate and jpa). It could be useful to include
  grate from hbm to another technology.                    other Java specifications for object persistence as well,
                                                           such as JDO. It would also be useful to consider other
kinds of databases (such as NoSQL, graph or object-          low-level jdbc solution is massively replaced by hbm or
oriented databases), since these are becoming increas-       jpa. The only significant technology migration we ob-
ingly more popular. A follow-up study could take into        served concerns the transition from hbm to jpa. More
account such alternative database technologies.              specifically, we summarise our main observations be-
   Other technological domains (beyond databases)            low.
could be considered as well. Event loggers, graph-               We analysed the evolution and co-occurrences of
ical user interfaces, and unit tests are examples of         the technologies in order to get a high-level view of
features supported by multiple concurrent technolo-          their usage in the considered Java projects. It appears
gies. Since the identification of the technology used in     that, most of the time, database technologies are in-
project files is the only part of our methodology that       troduced early in the projects’ lifetime, whether they
depends on the considered technologies, our approach         are the first technology introduced or not. Once intro-
could be easily adapted to study other technologies.         duced in a project, hbm tends to be complemented or
   Section 7 mentioned the limitations of the selected       replaced by another technology more frequently and
project corpus. We therefore intend to confirm our           more quickly than jpa and jdbc.
research results by considering a larger project corpus,         We also analysed how the technologies are used in
including both open and closed source projects. We           the source code files. The introduction of jdbc and
also intend to study the e↵ect of project quality and        hbm tends to be followed by an increasing use of the
project maturity on the obtained results. Finally, we        already present database technology. This increase is
intend to include other programming languages than           particularly important when hbm is introduced after
Java in the project corpus in order to avoid any bias        jpa. Conversely, the introduction of jpa reduces the
introduced by language-specific characteristics.             use of hbm. jpa therefore appears to replace existing
   While this paper only focused on technical aspects        hbm in the database-related source-code files, while the
of connecting source code to databases, we plan to           converse is not true.
study the social aspects of systems involving such a             Furthermore, jdbc generally does not share source
database connection. More precisely, we would like to        code files with the two other considered database tech-
determine if the di↵erent technologies are introduced        nologies. While jpa is used in isolation in a majority
and managed by di↵erent teams or persons. Inspired           of source code files, hbm tends to be used more often
by [21] we also aim to analyse the developer character-      in conjunction with jpa. The study of the evolution of
istics in order to determine how these a↵ect the take-       such co-occurrence reveals that a file migration from a
up, use, evolution and migration of technologies. Some       technology to another one is only observed from hbm
examples of developer characteristics are their degree       to jpa. In most projects, the introduction of a new
of specialisation, diversity, seniority, skills, and work-   database technology is not followed by a massive adop-
load.                                                        tion of this technology by the existing database-related
   Finally, we plan to analyse software systems in or-       files, until these files become database-unrelated or are
der to automatically identify library features used in       removed from the source code repository.
the source code, as well as feature similarities between         Exploiting all these results in a dashboard that sup-
di↵erent technologies. In situations where developers        ports managers in making project-specific decisions
want to migrate from a given technology to another,          with respect to the introduction, use or evolution of
such a feature identification and mapping is a first step    database access technologies remains part of future
towards better support for assisted or automatic mi-         work.
gration [22].
                                                             Acknowledgment
9    Conclusions                                             This research was conduced as part of the FRFC
                                                             research project T.0022.13 “Data-Intensive Software
Through static analysis of Java source code we carried
                                                             System Evolution” that was financed by the F.R.S.-
out a large-scale empirical study to understand how
                                                             FNRS, Belgium.
database access technologies interact with one another.
We considered three popular technologies (jdbc, hbm
and jpa) that represent di↵erent means to connect Java
source code files to a relational database. We selected
data from 2,457 open source projects on GitHub that
used at least one of the considered technologies.
   Our study revealed common behaviours in the use
of these three technologies. In spite of the promises
of ORM technologies, we found no evidence that the
References                                                   [12] C. Teyton, J. Falleri, M. Palyart, and X. Blanc,
                                                                  “A study of library migrations in Java,” Jour-
 [1] E. Rahm and P. A. Bernstein, “An online bib-
                                                                  nal of Software: Evolution and Process, vol. 26,
     liography on schema evolution,” SIGMOD Rec.,
                                                                  no. 11, pp. 1030–1052, 2014.
     vol. 35, no. 4, pp. 30–31, Dec. 2006.
                                                             [13] M. Goeminne and T. Mens, “Towards a survival
 [2] D. Sjoberg, “Quantifying schema evolution,” In-              analysis of database framework usage in Java
     formation and Software Technology, vol. 35, no. 1,           projects,” in Int’l Conf. Software Maintenance
     pp. 35 – 44, 1993.                                           and Evolution, 2015.

 [3] P. Vassiliadis, A. V. Zarras, and I. Skoulis, “How      [14] M. Goeminne, A. Decan, and T. Mens,
     is life for a table in an evolving relational schema?        “Co-evolving code-related and database-related
     Birth, death and everything in between,” in Int’l            changes in a data-intensive software system,” in
     Conf. Conceptual Modeling (ER), 2015, pp. 453–               CSMR-WCRE Software Evolution Week, 2014,
     466.                                                         pp. 353–357.
                                                             [15] M. N. C. Ireland, D. Bowers and K. Waugh,
 [4] A. S. Christensen, A. Møller, and M. I.
                                                                  “A classification of object-relational impedance
     Schwartzbach, “Precise analysis of string expres-
                                                                  mismatch,” in Intl Conf. Advances in Databases,
     sions,” in Int’l Conf. Static Analysis (SAS), 2003,
                                                                  Knowledge, and Data Applications (DBKDA),
     pp. 1–18.
                                                                  2009, pp. 36–43.
 [5] C. Gould, Z. Su, and P. Devanbu, “Static checking       [16] M. Allamanis and C. Sutton, “Mining source code
     of dynamically generated queries in database ap-             repositories at massive scale using language mod-
     plications,” in Int’l Conf. Software Engineering.            eling,” in Int’l Conf. Mining Software Reposito-
     IEEE Comp. Soc., 2004, pp. 645–654.                          ries. IEEE, 2013, pp. 207–216.
 [6] M. Sonoda, T. Matsuda, D. Koizumi, and S. Hi-           [17] M. Goeminne and T. Mens, “Evidence for the
     rasawa, “On automatic detection of SQL injec-                Pareto principle in open source software activity,”
     tion attacks by the feature extraction of the single         in Workshop on Software Quality and Maintain-
     character,” in Int’l Conf. Security of Information           ability (SQM), ser. CEUR Workshop Proceedings,
     and Networks (SIN), 2011, pp. 81–86.                         vol. 701. CEUR-WS.org, 2011, pp. 74–82.
                                                             [18] I. Samoladas, L. Angelis, and I. Stamelos, “Sur-
 [7] S. R. Clark, J. Cobb, G. M. Kapfhammer, J. A.
                                                                  vival analysis on the duration of open source
     Jones, and M. J. Harrold, “Localizing SQL faults
                                                                  projects,” Information & Software Technology,
     in database applications,” in Int’l Conf. Auto-
                                                                  vol. 52, no. 9, pp. 902–922, 2010.
     mated Software Engineering (ASE), 2011, pp.
     213–222.                                                [19] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamil-
                                                                  ton, D. M. Germán, and P. T. Devanbu, “The
 [8] M. A. Javid and S. M. Embury, “Diagnosing                    promises and perils of mining Git,” in Int’l Conf.
     faults in embedded queries in database applica-              Mining Software Repositories, 2009, pp. 1–10.
     tions,” in EDBT/ICDT’12 Workshops, 2012, pp.
     239–244.                                                [20] E. Kalliamvakou, G. Gousios, K. Blincoe,
                                                                  L. Singer, D. M. Germán, and D. Damian, “The
 [9] M. Linares-Vasquez, B. Li, C. Vendome, and                   promises and perils of mining GitHub,” in Int’l
     D. Poshyvanyk, “How do developers document                   Conf. Mining Software Repositories, 2014, pp. 92–
     database usages in source code?” in Int’l Conf.              101.
     Automated Software Engineering (ASE), 2015.
                                                             [21] B. Vasilescu, A. Serebrenik, M. Goeminne, and
[10] V. Bauer and L. Heinemann, “Understanding API                T. Mens, “On the variation and specialisation of
     usage to support informed decision making in                 workload: A case study of the Gnome ecosystem
     software maintenance,” in European Conf. Soft-               community,” J. Empirical Software Engineering,
     ware Maintenance and Reengineering, 2012, pp.                pp. 1–54, 2013.
     435–440.                                                [22] C. Teyton, J.-R. Falleri, and X. Blanc, “Auto-
                                                                  matic discovery of function mappings between
[11] C. Teyton, J. Falleri, and X. Blanc, “Mining li-             similar libraries,” in Working Conf. Reverse En-
     brary migration graphs,” in Working Conf. Re-                gineering, Oct 2013, pp. 192–201.
     verse Engineering, 2012, pp. 289–298.

</pre>