DockerPedia: a Knowledge Graph of Docker
                     Images

         Maximiliano Osorio, Carlos Buil-Aranda, and Hernán Vargas

     Informatics Department, Universidad Técnica Federico Santa Marı́a, Chile
                   {mosorio, cbuil, hvargas}@inf.utfsm.cl


      Abstract Docker is the most popular implementation of Operating Sys-
      tem virtualization, currently its online registry service (Docker Hub)
      stores more than 4.5 millions of software images. Using that registry
      it is possible to download and deploy Docker images as software contain-
      ers. However, these images only show information of the main software,
      hiding the dependencies needed to run it. To allow users to track what
      they deploy into their machines, we developed DockerPedia, a resource
      that publishes information of the packages within the Docker images as
      Linked Data. Currently our resource includes 28% of the most down-
      loaded images from Docker Hub providing information about the soft-
      ware dependencies and its vulnerabilities allowing to easily reproduce
      the environment in which each image was deployed as well as to check
      the security of the image without the need to download it.


1   Introduction
One of the most common ways for easily distributing and deploying soft-
ware packages is through operating-system-level virtualization. This techno-
logy, also known as containerization, refers to an Operating System (OS) fea-
ture in which its kernel allows the existence of multiple isolated user-space in-
stances called containers. The most popular virtualization technology is Docker
(https://www.docker.com/), which implements software virtualization by cre-
ating minimal versions of a base operating system (a container). Docker con-
tainers can be seen as lightweight virtual machines that allow the assembling
of a computational environment, including all necessary dependencies (libraries,
configuration, data needed, etc). Docker distributes these computational envi-
ronments as software images, allowing the user to easily deploy a server (e.g.,
Virtuoso), generate reproducible experiments (like in [2]) or even share movies
and music. Public and private Docker images can be stored in Docker Hub
(https://hub.docker.com/), using this registry it is possible to download and
locally deploy Docker images executing thus the software within it. However,
Docker Hub does not provide information about what packages are in the im-
ages, whether it will deploy correctly or if there are any security issues. Thus,
Docker images work as a black box: users know the main software package that
runs within the container but they do not have information about other packages
included as dependencies to run it.
   With DockerPedia we aim to put some light into the black box of running
Docker containers. We analyzed 28% of the most downloaded images from Docker
Hub, semantically described them and published these description as Linked
Data. The descriptions include information about the packages (name, version,
date of installation) and a security analysis of them. We linked all this data to
the Debian public RDF package1 database and to several public software vul-
nerabilities databases. The information provided from DockerPedia allow users
to know the exact packages used in a Docker image, its vulnerabilities and, if
necessary, reproduce the execution environment without the image itself.


2     Docker Overview
Docker builds software images by reading a set of instructions from a Docker File
(a text file containing a set of commands), these files normally have multiple lines
each of which are translated into a image layer when Docker builds the image.
In this process commands are executed sequentially creating one layer after the
another. This way, when an image is updated or rebuilt, only modified layers
(i.e. modified lines) are updated.

The Docker Hub Docker Hub is an online registry for Docker image repositories.
Currently stores more than 4.5 million images in two types of repositories: official
and community. Official repositories contain verified images such as Nginx, Red
Hat and Docker. Community repositories are not verified but can be created by
any user or organization. Users can upload images to Docker Hub from other
services such as GitHub or pushing their local images using the command line
interface that Docker provides. Multiple images can be uploaded to the same
repository specifying a different tag, in general this attribute is used to specify
the image version, labeling the most recent one as latest. Uploaded images are
available to use for members of the same organization or the whole community.


3     Docker Image Analysis
To extract the information from the Docker images hosted in Docker Hub we
performed a search over its free text box to obtain all the Docker images. In
February 2018, this search returned 1,363,510 Docker repositories and 4,608,443
images composed of 4,593,602 community images and 14,841 official images. The
total size of these images is 53.47 PB. Due to the large amount of computational
resources needed to examine such amount of data we only analyzed the 28% most
downloaded Docker images from Docker Hub.
    For each image we obtain the following information: image name, user name,
repository name, image description, last update date, number of pulls (down-
loads) and number of stars (users rating). Note that we get all the versions of
each image (i.e. all tag in each analyzed repository) and the information that
1
    https://packages.qa.debian.org/common/RDF.html
these includes such as the tag name, last updated date and image size. For in-
stance, the Docker repository “google/cadvisor” has 59 different images of the
same software, and each of these has different packages. We deploy the images
to extract the information of the packages installed in it and, using the Clair
software2 , we detect the vulnerabilities of each of them. To store the data from
the security analysis Clair uses a relational database. However, this database is
not available on the Web, thus it does not comply with any of the 5 stars for data
publishing [1]. DockerPedia will publish this data and the information gathered
from Docker Hub in the Web as Linked Data.

4     DockerPedia Data Publishing
4.1    The Docker Ontology
To publish all the data described in the previous section in the form of a knowl-
edge graph and link it to the security vulnerabilities databases and the Debian
package RDF store we used Morph [5], a Relational Database to RDF [3] en-
gine. Since the RDB2RDF process needs an ontology, we developed a lightweight
ontology to shape the relations between the different concepts we want to pub-
lish. The ontology imports classes and relations from the Docker Ontology [4]
for some of the Docker concepts and the WICUS ontology [6] for the soft-
ware experiment reproducibility concepts. The main classes in the ontology
are: SoftwareImage (image from which a container is deployed), ImageLayer
(a line within the Docker file that install software within the container), Soft-
warePackage (packages installed at the ImageLayer), PackageVersion (the ver-
sion of each package), SoftwareVulnerability (the vulnerabilities of each pack-
age) and the DockerFile class. DockerPedia resources are identified by the URI
http://dockerpedia.inf.utfsm.cl/resource while the vocabulary is identi-
fied by the URI http://dockerpedia.inf.utfsm.cl/ontology. The resulting
ontology is shown in the figure 1.


                              Figure 1. Docker ontology.


2
    Clair is an open-source tool from CoreOS designed to identify known vulnerabilities
    in Docker images. More information in https://github.com/coreos/clair
4.2   The DockerPedia Resource
The DockerPedia dataset contains information about 4,500,000 Docker Images
stored in the Docker Hub portal. These Docker images contain several layers
that install and remove software packages from the filesystem until the image
is finally deployed as a Docker container. We analyzed all the packages of all
the layers within the 28% most downloaded Docker images obtaining a total of
102.993.552 triples and 13.136 links to the Debian RDF package dataset.

Use cases: The goal of this resource is to allow users to understand what is
inside the Docker images they run as a container on their hosts (allowing to easily
reproduce the environment in which a software package was deployed). Examples
of these use cases can be found at https://dockerpedia.inf.utfsm.cl/examples.

5     Conclusions
Throughout this paper, we have presented a resource that allows users to analyze
Docker images before they run them into their hosts. We have gathered all
Docker images stored at DockerHub and analyzed the 28% most downloaded
of them. The analysis resulted in a dataset with more than 100 million triples,
storing data about Docker images, software packages and their vulnerabilities,
links to the Debian RDF package dataset and the vulnerabilities information
pages. The data gathered in DockerPedia allow users to explore the packages
and vulnerabilities of a Docker image without the necessity of deploying the
container, making possible to reproduce the software execution environment even
if the image itself is not available.

Acknowledgments:      The authors are supported by the Fondecyt Project
11170714.

References
1. T. Berners-Lee. Is your linked open data 5 star. Repéré à https://www. w3. org/De-
   signIssues/LinkedData. html, 2010.
2. C. Boettiger. An introduction to docker for reproducible research. SIGOPS Oper.
   Syst. Rev., 49(1):71–79, Jan. 2015.
3. S. Das, S. Sundara, and R. Cyganiak. R2rml: Rdb to rdf mapping language. w3c
   rdb2rdf working group, 2012.
4. D. Huo, J. Nabrzyski, and C. Vardeman. Smart container: an ontology towards con-
   ceptualizing docker. In International Semantic Web Conference (Posters & Demos),
   2015.
5. F. Priyatna, O. Corcho, and J. Sequeda. Formalisation and experiences of r2rml-
   based sparql to sql query translation using morph. In Proceedings of the 23rd inter-
   national conference on World wide web, pages 479–490. ACM, 2014.
6. I. Santana-Perez, R. F. da Silva, M. Rynge, E. Deelman, M. S. Pérez-Hernández,
   and O. Corcho. Reproducibility of execution environments in computational science
   using semantics and clouds. Future Generation Computer Systems, 67:354–367,
   2017.