-

Towards reproducibility of computational environments for Scienti c Experiments using Container-based virtualization.

Maximiliano Osorio

Carlos Buil-Aranda

Hernan Vargas

hvargasg@inf.utfsm.cl 0 0 Informatics Department, Universidad Tecnica Federico Santa Mar a , Chile

Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. To allow reproducibility, the scienti c community encourages researchers to publish descriptions of the these experiments. However, these recommendations do not include an automated way for creating such descriptions: normally scientists have to annotate their experiments in a semi automated way. In this paper we propose a system to automatically describe computational environments used in in-silico experiments. We propose to use Operating System (OS) virtualization (containerization) for distributing software experiments throughout software images and an annotation system that will allow to describe these software images. The images are a minimal version of an OS (container) that allow the deployment of multiple isolated software packages within it.

Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. Introducing changes allows to evaluate di erent experimental features of that experiment since researchers can incrementally modify it, improving and repurposing the experimental methods and conditions [ 12 ]. To allow experiment reproducibility it is necessary to provide enough information about that experiment, allowing to understand, evaluate and build it again. Usually, experiments are described in scienti c work ows (representations that allow managing large scale computations) which run on distributed computing systems. To allow reproducibility of these scienti c work ows it is necessary rst to address a workow conservation problem, since experimental work ows need to guarantee that there is enough information about the experiments so it is possible to build them again by a third party, replicating its results without any additional information from the original author [ 7 ].

To achieve conservation the research community has focused on conserving work ow executions by conserving data, code, and the work ow description, but not the underlying infrastructure (i.e. computational resources and software components). There are some approaches that that focused on conserving the environment of an experiment such as the work in [ 10 ] or the Timbus project1 [ 4 ] that focuses on business processes and the underlying software and hardware infrastructure. The authors in [ 10 ] identi ed two approaches for conserving the environment of a scienti c experiment: physical conservation, where the research objects within the experiment are conserved in a virtual environment; and logical conservation, where the main capabilities of resources in the environment are described using semantic vocabularies to allow a researcher to reproduce an equivalent setting. The authors de ned a process for documenting the work ow application and its related management system, as well as their dependencies. However this process is done in a semi-automated manner, leaving much work left to the scientists. Furthermore, usually most works leave out of the scope the physical conservation of the execution of scienti c work ows. In this paper we approach to both, physical and logical conservation problems.

Herein we propose to improve the physical conservation solution by using operating-system-level virtualization. This technology, also known as containerization, refers to an Operating System (OS) feature in which the OS kernel allows the existence of multiple isolated user-space instances called containers. One of the most popular virtualization technologies is Docker2, which implements software virtualization by creating minimal versions of a base operating system (a container). Docker Containers can be seen as lightweight virtual machines that allow the assembling of a computational environment, including all necessary dependencies, e.g., libraries, con guration, code and data needed, among others. Docker distributes this computational environment through software images. The main problem for conserving the physical environment of an experiment is the amount of space needed. With container virtualization it is possible to reduce to the minimum the size of the virtual machine needed for running the scienti c work ow. We propose rst to use Docker images as means for preserving the physical environment of an experiment. We use containers since they are lightweight and more importantly, they are easier to automatically describe so we improve the process of documenting scienti c work ows. We do that by describing the work ow management system, as well as their dependencies by developing an annotator system for the Docker images before.

In the paper, we present a new system that scans the DockerHub portal for Docker images and annotates them, achieving logical conservation by using container-based virtualization. We present this as a Proof-of-Work that will allow to conserve the logical environment of an experiment. We assume that the physical environment is already preserved at the Docker images we describe semantically. 2

Docker Overview

Docker is a technology that allows virtualizing a minimal version of an Operating System. Therefore users can run applications within it. Throughout this section, 1 http://www.timbusproject.net/ 2 https://www.docker.com/ we introduce how Docker and its registry (Docker Hub) work, starting with how Docker images are created and stored in Docker Hub. 2.1

Docker repositories and les

Docker builds a software image by reading a set of instructions from a Docker le. A Docker le is a text le that contains all commands to build a Docker image. Docker les usually have multiple lines, which are translated into image layers whereas Docker builds the image. In the build process, the command is executed sequentially, creating one layer after the other. When an image is updated or rebuilt, only modi ed layers (i.e., modi ed lines) are updated. 2.2

Publishing and Deploying Docker images from Docker Hub

Docker Hub is an online registry that stores two types of public repositories, both o cial, and community. O cial repositories contain public, veri ed images such as Canonical, Nginx, Red Hat, and Docker. At the same time, community repositories can be public or private and are created by any user or organization. By using that registry and a command line, it is possible to download and deploy Docker images locally as a running container into a host executing thus the software within the image. Anyone has the chance to create and store images into the Docker Hub registry by rst creating a descriptor le called Docker le. This descriptor describes what software packages will be within the image, builds the image and nally uploads it to Docker Hub. However, Docker Hub does not control what packages are in the images, whether the image will deploy correctly or the images might have any security problem. Thus, Docker images work as a black box, which refers that users know that the main software package runs within the container but they do not know the other packages needed to run it.

There are two ways of uploading images to a user repository, either by a push from a local host or automating that process from a Github repository. In order to push a repository to the Docker Hub, the users need to name their local images using their Docker Hub username, and the repository name they had created. Afterwards, users add multiple images to a repository by adding a speci c :<tag> to it. This is all the information that normally Docker images have, being thus almost impossible to reproduce the execution environment if any of the used software packages within the images is modi ed. 3

DockerPedia Resource

In this section, we describe how we automatically annotate the software images stored at the DockerHub. Once these images are annotated we claim that we can obtain the execution environment of a scienti c experiment by using a single SPARQL query. We store the descriptions generated at our RDF database called DockerPedia (available at https://dockerpedia.inf.utfsm.cl/). This RDF database is available as a 5-star linked data resource. 3.1

DockerPedia Annotator

We rst describe the images at DockerHub by using the DockerPedia Annotator. This tool extracts and annotates the software components, their version and metadata of a software image automatically. DockerPedia Annotator relies on four components: DockerPedia API, Metadata Extractor, Feature Extractor and Clair. First, DockerPedia-API receives the user environment and sent it to two components: Metadata Extractor and Clair. The Metadata Extractor extracts metadata from Docker le and DockerHub and saves the data on Postgres through DockerPedia API. In parallel, Clair downloads all layers from an image as a lesystem, mount it and analyzes it, determining the layer's operating system, and the packages added and removed from the layer. Finally, the Feature Extractor organizes the image layers and computes the resultant packages.

The Figure 1 shows an overview of the process.

Having the Docker images semantically described it is possible to obtain their descriptions all the packages (including their versions) and dependencies by using a SPARQL query. Notice that this is not possible by just looking at the Docker le from the original image (assuming it exists such le) since the Docker le only shows the main packages to be installed (i.e. through commands like apt-get install package name). In Listing 1.1 we show all packages that are installed within the TensorFlow Docker image:

The results from the query in Listing 1.1 are the 183 packages needed to run the TensorFlow image at DockerHub. With this data available, a scientist developing a TensorFlow algorithm today would know exactly what software packages she needs to run again that algorithm in the future, replicating the exact same execution environment. 3.2

The Annotation Process

To annotate the Docker images we need rst to obtain them from the Docker Hub. In February 2018, this search returned 1,363,510 Docker repositories and 4,608,443 images composed of 4,593,602 community images and 14,841 o cial images. The total size of these images is 53.47 PB. Our machines do not have the computing power nor the network requirements needed to perform the analysis to that amount of data. Therefore, we only analyzed the 160,000 most downloaded Docker images from Docker Hub. To perform that analysis we used seven virtual machines with following speci cations: 2 CPU (2.20GHz) and 3 GB of memory from Digital Ocean.

Once we discovered the name of the images to analyze, we obtain from the Docker Hub the Username, repository name, image description, whether the image is an automated build or not, when the image was updated for the last time, number of pulls and the number of stars of that image as well. We also obtain all versions of each image (which is called \tag" by the Docker Hub). For instance, the Docker repository \google/cadvisor" has 59 di erent images of the original software, and each image has di erent packages. By each of the tag, we obtain its name when was updated for the last time and its size.

After that, we use the tool from the Clair project3 to detect the packages and vulnerabilities within each downloaded image. Clair is an open-source tool from CoreOS designed to identify known vulnerabilities in Docker images. Clair has been primarily used to scan images in CoreOS is private container registry, Quay.io4, but it can also analyze Docker images. Clair downloads all layers from an image as a le system, mounts it and analyzes it, determining the layer's operating system, and the packages added and removed from the layer. Finally, the result of the analysis is the following: a list of the installed packages in the image, the layers associated with this image and the relationship between them. To store the data from the analysis Clair uses a relational database. However, this database is not available on the Web. Thus it does not comply with any of the ve stars for data publishing [ 2 ]. 3.3

The Docker Ontology

To publish all these data from the previous step in the form of a knowledge graph and link it to the Debian package RDF store we used Morph [ 9 ], a Rela3 https://coreos.com/clair/ 4 http://status.quay.io tional Database to RDF (RDB2RDF [ 5 ]) engine. Since the RDF2RDF process needs an ontology, we developed a lightweight ontology to shape the relations between the di erent concepts we want to publish. The ontology rst imports abstract classes from well-known ontologies and relations from the Docker Ontology [ 8 ] for some of the Docker concepts and the WICUS ontology [ 10 ] for the software experiment reproducibility concepts. A scienti c work ow requires a stack of software components, and the researchers must know how to deploy this software stack to achieve an equivalent environment. Thus, we import some abstract classes, and relations from WICUS ontology: (1) SoftwareStack describes the software components that must be installed and their dependencies, (2) DeploymentPlan, DeploymentStep, Con gurationInfo and Con gurationParameter classes describe the steps to deploy and con gure the software.

Our nal ontology is depicted in Figure 2.

The main classes in ontology are SoftwareImage (image from which a container is deployed), ImageLayer (a line within the Docker le that install the software within the container), the software packages installed at the ImageLayer (including security vulnerabilities and version packages) and the DockerFile class. DockerPedia resources are identi ed by the URI http://dockerpedia.inf.utfsm.cl/resource while the vocabulary is identied by the URI http://dockerpedia.inf.utfsm.cl/ontology. In Listing 1.2 we show the RDF data about the Google's \cadvisor" within Docker.

Related Work

Previously [ 11 ] analyzed the security of Docker images from a higher level point of view. However, this work has not published the data from the image analysis (that data mainly includes security vulnerability issues). Other work such as [ 6 ], have pointed the di culty of reproducing scienti c contributions due to their high dependence with the computational environment. The authors propose the use of Docker Containers to allow reviewers and future researchers to reproduce the experiments within the same environment. However, they do not provide any means to perform it, and they only express desiderata. Similarly, a di erent approach in which software ontologies are used is to allow software experiment reproducibility [ 10 ]. In that work, the authors present a set of ontologies that model software and hardware components to allow the experiment reproducibility. In [ 3 ] the author describes a set of good practices to use Docker for reproducing experiments. One of them is to archive tarball snapshots pointing that Docker can rollback layers that have been added to an image, but not revert to the earlier state of a particular layer. We allow the user to revert to that state by providing exact descriptions of each layer. Also, the project Timbus [ 1 ] address several issues including the computation environment conservation by means of an annotator. For this purpose, they proposed an extractor to extract and annotate the Software and Hardware components. Nevertheless, the extractor approach of Timbus Project is not adequate to be used in Containers since it increases the complexity of the container. Another related project is Common Work ow Language (CWL, https://www.commonwl.org/), CWL is a speci cation to describe analysis work ows and tools, as a result, this descriptions are portable and scalable over a variety of software and hardware environments including Docker Containers. However, CWL does not deal with the tasks of describing the resources that de ne the environment. The tool MyBinder (https://mybinder.org/) allows to generate Docker images from GitHub repositories containing the dependency les from that GitHub repository, however the tool does not describe them. Finally [ 13 ] describes how to use RDF to represent Docker les. 5

Conclusions

Throughout this paper, we have presented a resource that allows users to analyze Docker images before they run them into their hosts. We have gathered all (4.5 million) Docker images stored at DockerHub (5TB of data) and analyzed 150,000 of them. The analysis resulted in a dataset with more than 140 million triples, storing data about Docker images, software packages, and their vulnerabilities, links to the Debian package resource description, and the vulnerabilities information pages.

First, we propose a method to conduct logical and physical conservation of the computational environment of an experiment using Docker containers. Second, we achieve logical conservation without spending the considerable amount of e ort annotating the software components and their dependencies. Finally, we rely the physical conservation on DockerHub and their lightweight Docker images.

Barateiro ,

Draws ,

M. A.

Neuman , and

Strodl . Digital preservation challenges on software life cycle . In Software Maintenance and Reengineering (CSMR) , 2012 16th European Conference on, pages 487 { 490 . IEEE, 2012 .

2. T. Berners-Lee. Is your linked open data 5 star. Repere a https://www . w3. org/DesignIssues/LinkedData. html, 2010 .

Boettiger . An introduction to docker for reproducible research . SIGOPS Oper. Syst. Rev. , 49 ( 1 ): 71 { 79 , Jan . 2015 .

Dappert ,

Peyrard ,

C. C.

Chou , and

Delve . Describing and preserving digital object environments . New Review of Information Networking , 18 ( 2 ): 106 { 173 , 2013 .

Das ,

Sundara , and

Cyganiak . R2rml: Rdb to rdf mapping language . w3c rdb2rdf working group , 2012 .

Di Tommaso , E. Palumbo,

Chatzou ,

Prieto ,

M. L.

Heuer , and

Notredame . The impact of docker containers on the performance of genomic pipelines . PeerJ , 3 : e1273 , 2015 .

Garijo ,

Kinnings ,

Xie ,

Zhang ,

P. E.

Bourne , and

Gil . Quantifying reproducibility in computational biology: the case of the tuberculosis drugome . PloS one , 8 ( 11 ): e80278 , 2013 .

Huo ,

Nabrzyski , and

Vardeman . Smart container: an ontology towards conceptualizing docker . In International Semantic Web Conference (Posters & Demos) , 2015 .

Priyatna ,

Corcho , and

Sequeda . Formalisation and experiences of R2RMLbased SPARQL to SQL query translation using MORPH . In Proceedings of the 23rd international conference on World wide web , pages 479 { 490 . ACM, 2014 .

10. I. Santana-Perez , R. F. da Silva , M.

Rynge , E. Deelman, M. S.

Perez-Hernandez , and O.

Corcho . Reproducibility of execution environments in computational science using semantics and clouds . Future Generation Computer Systems , 67 : 354 { 367 , 2017 .

11.

Shu ,

Gu , and

Enck . A Study of Security Vulnerabilities on Docker Hub . In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy , CODASPY '17 , pages 269 { 280 , New York, NY, USA, 2017 . ACM.

12. V. C. Stodden . Reproducible research: Addressing the need for data and code sharing in computational science . Computing in science & engineering, 12 (5):8 { 12 , 2010 .

13.

Tommasini , B. De Meester,

Heyvaert ,

Verborgh , E. Mannens, and

E. Della

Valle . Representing docker les in RDF . In ISWC2017, the 16e International Semantic Web Conference , volume 1931 , pages 1{4 , 2017 .