<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Humpback: Code Completion System for Dockerfiles Based on Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kaisei Hanayama</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shinsuke Matsumoto</string-name>
          <email>shinsuke@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shinji Kusumoto</string-name>
          <email>kusumoto@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Information Science and Technology, Osaka University</institution>
          ,
          <addr-line>Suita, Osaka, 565-0871</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The object of this study is Docker, the de facto standard containerization platform. Containers in Docker are built by creating files called Dockerfiles. Managing the infrastructure as code makes it possible to incorporate knowledge gained from conventional software development. However, infrastructure as code is a relatively new technology, some domains of which have not been fully researched. In this study, we focus on code completion and aim to construct a system that supports the development of Dockerfiles. The proposed code completion system, Humpback, applies machine learning to a precollected dataset with long short-term memory to create language models and uses model switching to overcome a Docker-specific code completion problem. Evaluation experiments show that Humpback has a high average accuracy of 96.9%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Docker</kwd>
        <kwd>code completion</kwd>
        <kwd>machine learning</kwd>
        <kwd>language model</kwd>
        <kwd>long short-term memory</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Server virtualization is broadly used for cost reduction and eficient resource utilization. Among
various methods of virtualization, containerization has become mainstream [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Containerization
creates logical compartments (i.e., containers) on the host operating system. Each container
provides an independent environment.
      </p>
      <p>
        Docker1 is the de facto standard containerization platform [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Containers in Docker are
conifgured by writing imperative instructions in files called Dockerfiles. The process of managing
infrastructure configuration through machine-readable definition files is called infrastructure
as code (IaC). IaC enables developers to manage infrastructure configuration in the same way
as application code, allowing automated scaling and the prevention of human error [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
However, IaC is a relatively new technology and thus some areas are still in progress [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such as
development support and static analysis.
      </p>
      <p>
        In this study, we focus on code completion, a widely used feature in software development [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
We believe that providing a code completion system for emerging technology such as Docker
can considerably improve productivity by reusing existing knowledge and reduce common
errors.
      </p>
      <p>
        One concern when building a Docker-specific code completion system is base image
diferences. A base image, which includes a Linux distribution, is an image file on which containers
are created. Dockerfiles can have a nested language; embedded scripting languages (mainly
bash) are described in a nested state in the top-level syntax [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The contents of Dockerfiles
difer considerably depending on the base image. For example, for an base image includes
Ubuntu, the apt-get command is used in the RUN instruction, whereas for a CentOS base
image, the dnf command is used. For accurate code completion, base image diferences must
thus be taken into account.
      </p>
      <p>
        The contributions of this paper are as follows:
1. A solution to the Docker-specific challenge is presented. We introduce model
switching to overcome the problem caused by base image diferences. With model switching, language
models for predictions are switched depending on the base image. Long short-term memory
(LSTM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is employed to generate language models (section 3.2).
      </p>
      <p>2. A novel Docker-specific code completion system, Humpback, is implemented.
Figure 1 shows a screenshot of Humpback. Humpback is available online and can be used in a
web browser.2 Evaluation experiments show that Humpback has a high accuracy of 96.9% and
is useful for developing Dockerfiles (section 4.4).</p>
      <p>2https://sdl.ist.osaka-u.ac.jp/~k-hanaym/humpback/</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Code completion</title>
        <p>
          Code completion is extensively used in software development [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. A pop-up dialog is used to
display candidate words after the user has typed some characters. Developers select the desired
word from the list, reducing typos and other common errors. Another benefit is the faciliation
of the use of descriptive (i.e., long) names for variables. Manually entering long variable names
is cumbersome and error-prone.
        </p>
        <p>
          Traditional code completion systems display all candidates, which can be extremely long list.
A large number of intelligent code completion systems have been proposed to overcome this
problem [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Systems that use statistical language models such as N-gram and recurrent neural
network (RNN)-based approaches have achieved high performance. Given token sequence  of
length , the language model gives the probability  (1, ..., ). This probability indicates
the relative likelihood of words, which allows the construction of code completion systems.
Intelligent code completion considers the context and calculates probabilities based on language
models to narrow the list of candidate words. Compared to a traditional code completion system,
an intelligent one more efectively enhances developer productivity.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Docker, infrastructure as code, and challenges</title>
        <p>
          Docker, an open containerization platform, isolates applications from the development
environment with containers, allowing eficient resource utilization. Docker has become the de facto
standard container technology; over 87% of information technology companies use Docker [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Containers in Docker can be built by interactively executing commands or by creating
configuration files called Dockerfiles. Dockerfiles set up containers through imperative instructions,
enabling reproducible builds. A process for specifing the environment in which software systems
will be tested and/or deployed by configuration scripts is called IaC. Developers can manage
infrastructure configuration in the same way as application code, allowing automated scaling
and the prevention of human error [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Interest in IaC has thus grown [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Research on IaC is still in its infancy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. There are relatively few studies on IaC, and most
of them propose tools for implementing the practices of IaC itself. Knowledge in software
engineering, such as that on development support and static analysis, can be applied to IaC.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Humpback: code completion system for Dockerfiles</title>
      <sec id="sec-3-1">
        <title>3.1. System overview</title>
        <p>We propose Humpback, a code completion system for Dockerfiles. Humpback helps developers
to reduce errors and enhance eficiency when writing Dockerfiles. Various methods have been
used to implement code completion systems. Here, we employ language models. Statistically
processing pre-collected Dockerfiles and performing contextual predictions makes it possible to
reuse existing knowledge. We also introduce model switching to overcome the problem, caused
by base image diferences.</p>
        <p>GitHub API</p>
        <p>…
Dockerfiles</p>
        <p>Training data</p>
        <p>LSTM</p>
        <p>Language model
File collection</p>
        <p>Data processing</p>
        <p>Language model generation
FROM centos
FROM centos RUN</p>
        <p>FROM centos RUN dnf
1. Divide the contents
of Dockerfiles
and convert them into
Input Expected Output</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Methodology</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Learning phase</title>
          <p>The methodology of Humpback is divided into the learning phase and the prediction phase.
The learning phase includes file collection, data processing, and language model generation.
Figure 2 shows an overview of the learning phase.</p>
          <p>File collection: We search for repositories with Dockerfiles using the GitHub API 3, pull
these repositories in order of their star count (i.e., popularity), and extract the Dockerfiles.</p>
          <p>Data processing: The contents of the collected Dockerfiles are divided into token
sequences. The inputs are paired with the expected outputs. For example, if there is a statement
FROM centos RUN dnf, then FROM expects centos and FROM centos expects RUN. Next,
these data are encoded using integer values for the learner to interpret eficiently. The number
of elements in the training data varies. Therefore, 0-padding is performed to obtain fixed-length
data.</p>
          <p>
            Language model generation: Humpback uses language models for prediction. We assume
that the contents of Dockerfiles are time-series data. LSTM [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], an improved RNN architecture
used in the field of deep learning, is employed to generate language models. The middle layer
of the RNN is replaced with LSTM blocks, which allow for learning with long-term dependency.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Prediction phase</title>
          <p>Humpback uses model switching to overcome the problem caused by base image diferences.
Pre-trained language models for each Linux distribution are prepared in advance. Humpback
switches models for prediction depending on the input data. For instance, if the base image of
input data is Ubuntu, a model trained with Dockerfiles whose base images are Ubuntu is used.</p>
          <p>However, it is impossible to identify the Linux distribution from the base image name
in some cases. For example, we can guess that “openjdk:11-jdk” will include the Java
development environment, but cannot guess its Linux distribution. We created a base image
detector to determine the Linux distribution for a given Dockerfile. First, the base image
detector builds a container from the Dockerfile. Then, it identifies the distribution based on
the /etc/os-release file. We analyzed the base images of the entire dataset (section 4.2).
With these results, Humpback can switch models for prediction even if the Linux distribution
is not explicitly specified. For example, the base image detecor identified the distribution of
openjdk:11-jdk as Debian.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Implementation</title>
        <p>Three libraries/frameworks are used to implement Humpback, namely TensorFlow4, a software
library for machine learning, Keras5, a high-level neural network library, and Optuna6, a
hyperparameter auto-optimization framework. Candidate words are presented immediately,
thus developers can use Humpback without slowing down their development process.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Evaluation metrics</title>
        <p>
          We conducted evaluation experiments to verify that model switching improves the accuracy of
code completion. Top-k accuracy (()) and the mean reciprocal rank ( ) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] are used
as metrics for evaluating accuracy:
        </p>
        <p>() = ||−  ,   = |1| ∑︀|=|1 1 where −  refers to the number of relevant
recommendations in top  suggestions, || represents the total number of queries, and 
denotes the rank position of the first relevant word for the -th query. For both () and
 , a value closer to 1 indicates better model performance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dataset</title>
        <p>We collected 21,190 Dockerfiles using the GitHub API. The numbers of Dockerfiles and their
versions for various Linux distributions are shown on the left side of Table 1. The major
4https://www.tensorflow.org/
5https://keras.io/
6https://preferred.jp/en/projects/optuna/
distributions in the dataset are Alpine Linux, Debian GNU/Linux, and Ubuntu. The dataset
for Ubuntu has the most variety, with 19 versions in 1,497 files. In the table, “Others” includes
Amazon Linux, CentOS, Fedora, Oracle Linux Server, and VMware Photon OS/Linux.</p>
        <p>The number of epochs and the learning duration are shown on the right side of Table 1.
Hyperparameters such as the activation/optimization function, and number of units in each
layer were optimized using Optuna.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experiment design</title>
        <p>There were three axes of comparison: the presence or absence of model switching, the Linux
distribution, and the syntax. We compared the recommendation accuracy for the three major
distributions in the dataset, both with and without model switching. For the case without model
switching, we created a generic model that was trained with all Dockerfiles. Two syntaxes were
defined; descriptions in the RUN instruction were defined as Shell syntax and other descriptions
were defined as Docker syntax.</p>
        <p>We first extracted 100 Dockerfiles from the dataset and set the correct answer to a random
position in each Dockerfile. Next, the contents from the beginning of the file to just before
the correct answer were given to the language models and predictions were generated. Then
() and   were computed by comparing the predictions against the correct answer.
Ten rounds of the above process were performed for each comparison axis.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Experiment results</title>
        <p>Table 2 shows the average scores of (1), (5), and  . “Gen.” refers to the generic
model (i.e., without model switching). “Hump.” refers to Humpback (i.e., with model switching).
The numbers in bold indicate the best scores in a given category.</p>
        <p>Prediction with Humpback is more accurate for almost all evaluation axis. Model switching
is thus beneficial for building Docker-specific code completion systems. Humpback achieved an
outstanding average Top-1 accuracy of 96.9% (up to 98.8% for Debian, Docker syntax). Moreover,
the accuracy improved by up to 5.0% (Ubuntu, Docker syntax) compared to that for the generic
model. As described in section 3.3, the candidate words are presented instantly. With its
quickness and high accuracy, Humpback can significantly improve productivity.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we proposed Humpback, a code completion system for Dockerfiles. Humpback is
available online and can be used in a web browser. We introduced model switching to overcome
a Docker-specific problem. Evaluation experiments showed that Humpback has a high average
accuracy of 96.9%, and that model switching improves the accuracy of Humpback. In future
work, we will further improve the accuracy of Humpback and compare Humpback with other
code completion systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported in part by MEXT/JSPS KAKENHI Grant No. 18H03222.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Henkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Lahiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Reps</surname>
          </string-name>
          ,
          <article-title>A dataset of dockerfiles</article-title>
          ,
          <source>in: International Working Conference on Mining Software Repositories</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Portworx</surname>
          </string-name>
          , Container adoption survey,
          <year>2019</year>
          . URL: https://portworx.com/wp-content/ uploads/2019/05/2019-container
          <article-title>-adoption-survey</article-title>
          .
          <source>pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Artac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Borovssak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Nitto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerriero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Tamburri</surname>
          </string-name>
          , Devops:
          <article-title>Introducing infrastructure-as-code</article-title>
          ,
          <source>in: International Conference on Software Engineering Companion</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mahdavi-Hezaveh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>A systematic mapping study of infrastructure as code research</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>108</volume>
          (
          <year>2019</year>
          )
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monperrus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezini</surname>
          </string-name>
          ,
          <article-title>Learning from examples to improve code completion systems</article-title>
          ,
          <source>in: European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Gers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <article-title>Learning to forget: continual prediction with lstm</article-title>
          ,
          <source>in: International Conference on Artificial Neural Networks</source>
          , volume
          <volume>2</volume>
          ,
          <year>1999</year>
          , pp.
          <fpage>850</fpage>
          -
          <lpage>855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Svyatkovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sundaresan</surname>
          </string-name>
          , Pythia:
          <article-title>Ai-assisted code completion system</article-title>
          ,
          <source>in: International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2727</fpage>
          -
          <lpage>2735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Parnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Helms</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Atlee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Boughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghattas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Glover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Micco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Savor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stumm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whitaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>The top 10 adages in continuous deployment</article-title>
          ,
          <source>IEEE Software 34</source>
          (
          <year>2017</year>
          )
          <fpage>86</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , W. Fan,
          <article-title>Evaluating web-based question answering systems</article-title>
          ,
          <source>in: International Conference on Language Resources and Evaluation</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>1153</fpage>
          -
          <lpage>1156</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>