14


Containerized A/B Testing
ÁDÁM RÉVÉSZ and NORBERT PATAKI, Eötvös Loránd University, Faculty of Informatics


Software version ranking plays an important role in improved user experience and software quality. A/B testing is technique to
distinguish between the popularity and usability of two quite similar versions (A and B) of a product, marketing strategy, search
ad, etc. It is a kind of two-sample hypothesis testing, used in the field of statistics. This controlled experiment can evaluate user
engagement or satisfaction with a new service, feature, or product. A/B testing is typically used in evaluation of user-experience
design in software technology. DevOps is an emerging software methodology in which the development and operations are not
independent processes, they affect each other. DevOps emphasizes the usage of virtualization technologies (e.g. containers).
Docker is widely-used technology for containerization. In this paper we deal with a new approach for A/B testing via Docker
containers. This approach is DevOps-style A/B testing because after the evaluation the better version remains in production.
Categories and Subject Descriptors: K.6.3 [Management of Computing and Information Systems]: Software Manage-
ment—Software selection; H.5.2 [Information Interfaces and Presentation]: User Interfaces—Evaluation/
methodology; D.2.9 [Software Engineering]: Management—Software Management
General Terms: Software Quality Analysis with Monitoring
Additional Key Words and Phrases: Docker, containers, DevOps, A/B testing


1.   INTRODUCTION
Nowadays A/B testing plays an important role in evaluation of different but very akin user-experience
design. It is widely used among online websites, including social network sites such as Facebook,
LinkedIn, and Twitter to make data-driven decisions [Xu et al. 2015]. A/B testing is an important
method regarding how social media affects the software engineering [Storey et al. 2010]. A/B testing
has been applied in many web portals successfully [Kohavi et al. 2009].
   A/B testing is a powerful method because it is based on the behavior of the end-users. A/B test-
ing of webpages or webapplications requires small changes in the user-experience design (e.g. colors,
structure of a page, shape of buttons, etc.). As visitors are served either the control or variation, their
engagement with each experience is measured and collected. It can be determined whether changing
the experience had a positive, negative, or no effect on visitor behavior from the collected info. A/B
testing of these applications takes time for collecting proper number of feedback.
   Containerization is new directive in virtualization: this lighweight approach supports operating
system-based isolation among different pieces of the application [Soltesz et al. 2007].Containeriza-
tion is on the crest of a wave since Docker has been developed. Docker provides a systematic way to
automate the faster deployment of Linux applications inside portable containers. Basically, Docker
extends casual Linux containers (LXC) with a kernel-and application-level API for improved isola-
tion [Bernstein 2014]. Docker is emerging tool to start complex application in many virtual operating

This work is supported by EFOP-3.6.2-16-2017-00013.
Author’s address: Á. Révész, N. Pataki, Eötvös Loránd University, Faculty of Informatics, Department of Programming Lan-
guages and Compilers, Pázmány Péter sétány 1/C, Budapest, Hungary, H-1117 email: reveszadam@gmail.com, patakino@elte.hu

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: Z. Budimac (ed.): Proceedings of the SQAMIA 2017: 6th Workshop of Software Quality, Analysis, Monitoring, Improvement,
and Applications, Belgrade, Serbia, 11-13.9.2017, Also published online by CEUR Workshop Proceedings (http://ceur-ws.org,
ISSN 1613-0073)
14:2   •    Á. Révész, N. Pataki

system-separated parts and it ensures the communication among them. Docker has a comprehensive
documentation [Docker Inc. 2017].
   Docker containers are built up from base images, there are general images (e.g. Ubuntu 16.04) and
specific images (for Python run environment). Dockerfiles describe how an image can be created and
Docker is able to generate the image and save it to the repository. Many services of the Docker platform
are available (e.g. Docker Engine, Docker Compose, etc.). The images are in Docker Registries. The
Docker Engine is responsible for managing containers (starting, stopping, etc.), while Docker Compose
is responsible for the configurations of containers on a single host system. Docker Compose is mainly
used in development and testing environments. One can define which services are required and what
are their configuration in the container. Docker Compose files can be created for this purpose. Orches-
tration of Docker containers are typically executed with Kubernetes or OpenShift [Vohra 2017].
   Continuous Delivery (CD) is a software development discipline. This methodology aims at building
software in such a way that the software can be released to production at any time. It is a series of
processes that aims at safe and rapid deployment to the production. Every change is being delivered to
a production-like environment called staging environment. Rigorous automated testing ensures that
the business applications and service work as expected. Every change had been tested in staging, so
the application can be deployed to production safely.
   The DevOps approach extends the CD discipline and focuses on comprehensive CD pipelines: starts
with building and followed by different kinds of testing [Schaefer et al. 2013]. Unit testing, component
testing, integration testing, end-to-end testing, performance testing, etc. should be performed on the
software [Roche 2013]. In the meantime, static analyser tools try to find bugs, code smells and memory
leaks in the source code. 3rd -party compliance should be checked in the build pipeline. Automated
vulnerability scanning of the software is mandatory to discover security gaps. The visibility of the
whole process is guaranteed.
   After this phase, the automatic deployment of application starts. Application Release Automation
(ARA) tools are available that can communicate with the CI server and the deployment steps can be
designed on a graphical user interface of these tools. The DevOps culture argues for the deployment
automation at the level of the application [Cukier 2013]. The automatic upgrade and roll-back pro-
cesses involve many difficult changes. Database schemas, configuration files and parameters, APIs,
3rd -party components (e.g. message queues) may be changed when a new software version is released.
The deployment process has to cover these changes as well and requires automation and visibility.
   DevOps considers the monitoring and logging of the deployed application in the production envi-
ronment [Lwakatare et al. 2015]. The development team is eager for feedback from the application
which is in the production environment: e.g. what are the unused features in the software, memory or
other resource leak detection or performance bottlenecks. ELK-stack is a popular toolset for this pur-
pose [Lahmadi and Beck 2015]. Elasticsearch is a distributed search and analytics engine, Logstash
is a data processing pipeline and Kibana is responsible for the visualization. Docker out of box sup-
ports some logging drivers such as JSON log driver and GELF log driver to handle the log streams of
each container. With GELF log driver the container logs can be forwarded to an ELK stack. Graylog
Extended Log Format (GELF) is understood by most of the log aggregating systems like Logstash or
more obviously Graylog. Developers have to get as much information as possible to be able to take
care of a trouble [Prakash et al. 2016]. Problems may cause automatic roll-back of the application to
the previous stable version in a seamless way. The analysis of logs and monitoring data is application-
specific and their evaluation may be difficult. Therefore, using big data analysis and machine learning
shall be involved.
   In this paper we argue for a new DevOps-style A/B testing for an automated, user experience-based
approach. We take advantage of logging and monitoring features to get feedback from the end-users.
                                                                    Containerized A/B Testing   •   14:3

Our approach works in Docker containerized realm, thus the webapplications and every tool which are
used in the evaluation run in containers. After the specified duration the A/B test is evaluated and
winner version of webapplication remains in the production environment automatically.
  This paper is organized as follows: in section 2 we present our A/B testing approach from high-level
and go into implementation details per components in section 3. Finally, this paper concludes and
presents future work in section 4.

2.    OUR APPROACH
We propose an approach for A/B testing of webapplications in Docker containerized way. This approach
takes advantage of Docker, Nginx server, ELK stack and GrayLog. We have developed a script for
controlling the A/B testing. This script is written in Python.
  The two variants of the same webapplication are running in separate set of containers. The Nginx
server is also running in container. The Nginx is routing the users to A or B version based on their
IP hash. On the client side of both webapplications HTTP requests are submitted to the Nginx server.
Two kinds of requests are in-use. The first one is a periodical one which states if the user still using
the application. The second one is triggered by the end-user to check the user’s activity. Both requests
contains the origin of the application version as tag. We collect the logs of webapplications in an ELK-
stack.
  The Python script runs on the host machine. The script takes a duration parameter that specifies
how long the A/B test is running. When this duration expires the script gets log information from the
ELK-stack container and evaluates which version is the better one. The script controls the Docker to
discontinue the running of the worse version and replaces it with the better one.

3.    TECHNICAL DETAILS
3.1   Client side
For our research we created two versions of a simple website with different title and headlines clearly
indicating which version we are looking at using our web browser. Both version has a link.
  The page also contains a JavaScript script which acts like a subset of any other webpage analytics
bundle. It generates a UUID on every page load and sends a HTTP GET request to the ’/ping’ route
sending the generated client id as parameter in every 5th second. By these messages we can set a
metric which can indicate how long our user stays on the page. We also send a HTTP GET request
containing the client id on the click event of the link to the ’/click’ endpoint.
  We do not create any relation between those UUIDs and session cookies to keep it anonymous like a
good analytic tool anyway. We did not want to make any unnecessary (A or B) version specific code in
the web page (nor the backend) because it could pollute the source code of the product itself and can
be irrelevant of its aspects.

3.2   Backend side
The backend simply serves a static HTML file (which contains all of the client side code) and responds
with status 200 on every request at route ’/click’ and ’/ping’. It dumps every request to the standard
output. All of these configurations have been done in a single nginx.conf file to keep this proof of
concept project simple.

3.3   Docker containers
First of all, our testing stack has a load balancer container which listens on port 80 and forwards the
requests to node1 or node2. The forwarding depends on the client IP hash in order to make sure the
14:4       •       Á. Révész, N. Pataki

clients click and ping requests are being forwarded to the very same node which served the HTML file
earlier (so the load balancer will not switch up the version between two requests from the same client).
   The load balancer references Node1 and Node2 by their aliases. Docker Engine has a solution to
create virtual networks between containers so when there are multiple products up and running
containers on the same host machine they do not interfere with each other connecting to separate
virtual networks. Docker compose takes care about creating a network for our project defined in a
docker-compose.yaml file (see below) by default. This default network is created with the name of the
containing folder (assuming it is the same as the project name) with a default prefix. This came
handy when we created a new container and connected it to the same network by hand on the and of
the test evaluation.
   The Docker Engine takes care about DNS services on the virtual network that is why we can refer-
ence containers by their names. We do not need to change configurations on every startup and we do
not have to save IP addresses in environment variables or hosts files on containers. It is more dynamic
and more secure.
       error_log /dev/stdout info;

       events {}

       http {
           access_log /dev/stdout;
           upstream abtest {
               ip_hash;
               server node1;
               server node2;
           }

               server {
                   listen 80;

                    location / {
                        proxy_pass http://abtest;
                    }
               }
       }
  Node1 only differs from Node2 in its index.html file and more importantly in its tag. Node1 has
“version-a” tag while Node2 has “version-b” tag at the beginning. The version tag is also sent in every
log message to the Graylog server providing the identity of the version. As shown at the code snippet
below Node1 and Node2 have not got any open ports they can receive requests only through the load
balancer.
  As we mentioned earlier the backend prints all of its requests to the standard output. The standard
output is forwarded in GELF format to the GELF server.
       version: ’3’

       services:
           loadBalancer:
               image: nginx
                                                                    Containerized A/B Testing   •   14:5

              ports:
                  - "80:80"
              volumes:
                  - ./etc/nginx.conf:/etc/nginx/nginx.conf:ro

         node1:
             image: nginx
             logging:
                 driver: gelf
                 options:
                      gelf-address: "udp://127.0.0.1:12201"
                 tag: "version-a"
             volumes:
                 - ./nodes/static/versionA:/usr/share/nginx/html:ro
                 - ./nodes/etc/nginx.conf:/etc/nginx/nginx.conf

         node2:
             image: nginx
             logging:
                 driver: gelf
                 options:
                      gelf-address: "udp://127.0.0.1:12201"
                 tag: "version-b"
             volumes:
                 - ./nodes/static/versionB:/usr/share/nginx/html:ro
                 - ./nodes/etc/nginx.conf:/etc/nginx/nginx.conf


3.4   Log aggregation
There are numerous ELK stack configurations available on the Docker community hub so we omit the
details for now. We have a Graylog server up and running which receives the logs of Node1 and Node2.
We have set up an extractor which checks the message property of the log and uses regular expression
to extract ’click’ or ’ping’ from the request route to a separate field called clientLogEvent when it is
present and an other extractor works in the same way and extracts clientSessionId. Making extrac-
tors and testing queries on the Graylog web interface is comfortable and can be done without the need
of digging in Elastic search querying. It is suitable for anyone who wants to shape it to fit their own
specific A/B test scenario.

3.5   Evaluation and replacement
We have decided this task has to be done on a host machine by a script which can interact with the
Docker Engine (or Swarm, Kubernetes, etc). For security reasons we cannot (neither want to) give a
container access to other containers on system level.
  We have chosen Python as the most suitable script language for this task. Python has maturity, and
most of the *nix boxes have Python environment pre installed and another good reason is that Docker
has a solid Python SDK, actively used by the Docker Compose project.
  In our example we have decided to measure the count of click metric (’/click’ route requests) – the
bigger the better. When we exceed the duration of the test the script sends one query per version to
14:6     •   Á. Révész, N. Pataki

the Graylog server API to count its clicks (clientLogEvent: click). We use Apache Lucene syntax for
queries. The script compares the result and then with the power of the Docker SDK shuts down the
loser version Node and replaces it with an instance of the winner version.
  The Python script itself interacts with the Graylog Web API and gets a session token by sending
login credentials. At this point we could use API tokens set up on the Graylog Web interface, but we
have not wanted to increase the complexity of the configuration for this example. The query is sent
to the Graylog REST API, but it is just like any other REST API call so we omit the details for now.
The interesting part is how we replace the container running the worse version with a new container
running the better one. We stop and remove the “loser” container at first to avoid naming conflicts later
on. After that we create a new container with the same parameters as the “winner” container, but with
the name of the loser one. We connect the new container to the projects network using the same alias
as the removed container had. When we start up the new container the load balancing works the same
as before and the new node can be reached by the same name its predecessor could be reached by.
       loser = client.containers.get(loserContainerName)
       loser.stop()
       loser.remove()

       newNode = client.containers.create(’nginx’,
           name = loserContainerName,
           volumes_from = [winnerContainerName],
           log_config = {
               ’driver’: ’gelf’,
               ’options’: {
                   ’gelf-address’: ’udp://127.0.0.1:12201’,
                   ’tag’: winnerTag
               }
           }
       )
       bridgeNetwork = client.networks.get(’bridge’)
       bridgeNetwork.disconnect(loserContainerName)

       testNetWork = client.networks.get(self.networkName)
       testNetWork.connect(
           loserContainerName,
           aliases = [loserContainerName]
       )
       newNode.start()
  The script is just a proof of concept but we have created a command line interface for it because
we have created some parameters so we can test it on different setups. Its help text tells us what
parameters we can use for our test.
$ abtestCli -h
usage: abtestCli.py [-h] [--duration DURATION] [--aTag ATAG] [--bTag BTAG]
                    [--networkName NETWORKNAME] [--apiAddress APIADDRESS]
                    [--apiUser APIUSER] [--apiPass APIPASS]
                    aName bName
                                                                       Containerized A/B Testing   •   14:7

A CLI tool for runtest

positional arguments:
  aName
  bName

optional arguments:
  -h, --help            show this help message and exit
  --duration DURATION
  --aTag ATAG
  --bTag BTAG
  --networkName NETWORKNAME
  --apiAddress APIADDRESS
  --apiUser APIUSER
  --apiPass APIPASS

3.6     Running
Assume that we have the docker-compose.yaml file in our currently working directory.
       $ docker-compose up -d
  After it started up our services we only have to start our Python CLI script. It has three mandatory
parameters:
(1) Duration – in ISO 8601 duration format
(2) A version container name
(3) B version container name
       $ abtestCLI.py PT30M ab_node1_1 ab_node2_1
     After thirty minutes the script will log the name of the better version and replace the worse with it.

4.    CONCLUSION
A/B testing is a powerful method to improve software quality and user experience. It gains feedback
from two akin versions of the same product (software, search ad, newsletter email, etc.) and it measures
the end-user engagement.
   We have developed an approach and related tools for executing A/B testing in Docker containerized
environment. Our proof of concept implementation is working and has fulfilled our expectations but
there is a lot of work to do and a numerous of choices to make before it becomes production ready. One
of our goals was to keep the stack and the implementation simple to leverage the understanding of the
conception.
   We have mentioned that Docker Compose is for single host development and testing. And it did a
great job providing us an initial state for our services. Also we have met the limitations of it such
as dynamic configuration. Assume that we use the same stack, the A/B test is over and there are
winner version containers everywhere then our system shuts down. Since Docker Compose cannot
persist configuration changes to its compose file our configuration will be restored to the original one
on the next docker-compose up command. There are great configuration management software tools
like Puppet or Chef [Spinellis 2012]. Of course when it comes down to scalability we have to use Docker
Swarm or Kubernetes client libraries, etc for managing version replacement on a multi-host system.
14:8     •     Á. Révész, N. Pataki

  The concept is proven and we are excited to set it working on enterprise level. There could be a great
A/B test deployment service on Amazon AWS or Microsoft Azure. Those companies have resources and
technology to create a powerful analytics system with an integrated automatic deploy solution.
REFERENCES
David Bernstein. 2014. Containers and Cloud: From LXC to Docker to Kubernetes. IEEE Cloud Computing 1, 3 (Sept 2014),
  81–84. DOI:http://dx.doi.org/10.1109/MCC.2014.51
Daniel Cukier. 2013. DevOps Patterns to Scale Web Applications Using Cloud Services. In Proceedings of the 2013 Companion
  Publication for Conference on Systems, Programming, & Applications: Software for Humanity (SPLASH ’13). ACM, New York,
  NY, USA, 143–152. DOI:http://dx.doi.org/10.1145/2508075.2508432
Docker Inc. 2017. Docker Documentation. https://docs.docker.com/. (2017).
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009.                            Controlled experiments
  on the web: survey and practical guide.              Data Mining and Knowledge Discovery 18, 1 (2009), 140–181.
  DOI:http://dx.doi.org/10.1007/s10618-008-0114-1
Abdelkader Lahmadi and Frédéric Beck. 2015. Powering Monitoring Analytics with ELK stack. 9th International Conference
  on Autonomous Infrastructure, Management and Security (AIMS 2015). (June 2015). https://hal.inria.fr/hal-01212015
Lucy Ellen Lwakatare, Pasi Kuvaja, and Markku Oivo. 2015. Dimensions of DevOps. In Agile Processes in Software Engi-
  neering and Extreme Programming: 16th International Conference, XP 2015, Helsinki, Finland, May 25-29, 2015, Proceed-
  ings, Casper Lassenius, Torgeir Dingsøyr, and Maria Paasivaara (Eds.). Springer International Publishing, Cham, 212–217.
  DOI:http://dx.doi.org/10.1007/978-3-319-18612-2 19
Tarun Prakash, Misha Kakkar, and Kritika Patel. 2016.             Geo-identification of web users through logs using ELK
  stack. In 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence). 606–610.
  DOI:http://dx.doi.org/10.1109/CONFLUENCE.2016.7508191
James Roche. 2013. Adopting DevOps Practices in Quality Assurance. Commun. ACM 56, 11 (Nov. 2013), 38–43.
  DOI:http://dx.doi.org/10.1145/2524713.2524721
Andreas Schaefer, Marc Reichenbach, and Dietmar Fey. 2013. Continuous Integration and Automation for DevOps. In
  IAENG Transactions on Engineering Technologies: Special Edition of the World Congress on Engineering and Computer
  Science 2011, Kon Haeng Kim, Sio-Iong Ao, and B. Burghard Rieger (Eds.). Springer Netherlands, Dordrecht, 345–358.
  DOI:http://dx.doi.org/10.1007/978-94-007-4786-9 28
Stephen Soltesz, Herbert Pötzl, Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. 2007. Container-based Operating System
  Virtualization: A Scalable, High-performance Alternative to Hypervisors. SIGOPS Oper. Syst. Rev. 41, 3 (March 2007), 275–
  287. DOI:http://dx.doi.org/10.1145/1272998.1273025
Diomidis Spinellis. 2012.         Don’t Install Software by Hand.             IEEE Software 29, 4 (July 2012), 86–87.
  DOI:http://dx.doi.org/10.1109/MS.2012.85
Margaret-Anne Storey, Christoph Treude, Arie van Deursen, and Li-Te Cheng. 2010. The Impact of Social Media on Software
  Engineering Practices and Tools. In Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research
  (FoSER ’10). ACM, New York, NY, USA, 359–364. DOI:http://dx.doi.org/10.1145/1882362.1882435
Deepak Vohra. 2017.               Using an HA Master with OpenShift.                     Apress, Berkeley, CA, 335–353.
  DOI:http://dx.doi.org/10.1007/978-1-4842-2598-1 15
Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture: A/B Testing
  Challenges in Large Scale Social Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
  Discovery and Data Mining (KDD ’15). ACM, New York, NY, USA, 2227–2236. DOI:http://dx.doi.org/10.1145/2783258.2788602