7  Infrastructure

Yes, there is a third area besides your research and carpentry level programming that I suppose you should get an idea of. Again, you do not have to master hosting servers or even clusters, but a decent overview and an idea of when to use what will help you tremendously to plan ahead.

7.1 Why Go Beyond a Local Notebook?

Admittedly, unless just always had a knack for Arduinos, Raspberry Pis or the latest beta version of the software you use, infrastructure may be the one area you perceive as overhead. So, why leave the peaceful, well-known shire of our local environment for the uncharted, rocky territory of unwelcoming technical documentations and time-consuming rabbit holes?

Performance, particularly in terms of throughput, is one good reason to look beyond desktop computers. Data protection regulations that prohibit data downloads may simply force us to not work locally. Or we just do not want a crashing office application to bring down a computation that ran for hours or even days. Or we need a computer that is online 24/7 to publish a report, website or data visualization.

Thinking of a server as a ‘program that listens’ rather than a data vault that looks like a fridge is a good starting point. (source: own illustration)

7.2 Hosting Options

So, where should we go with our project when it outgrows the local environment of our notebooks? This has actually become a tough question because of all the reasonable options out there. Technical advances, almost unparalleled scalability and large profits for the biggest players made modern infrastructure providers offer an incredible variety of products to choose from. Obviously, a description of product offerings in a vastly evolving field is not well suited for discussions in a book. Hence, Research Software Engineering intends to give an overview to classify the general options and common business models.

7.2.1 Software-as-a-Service (SaaS)

The simplest solution and fastest time-to-market is almost always to go for a SaaS product – particularly if you are not experienced in hosting and just want to get started without thinking about which Linux to use and how to maintain your server. SaaS products abstract all of that away from the user and focus on doing one single thing well. The shinyapps.io platform is a great example of such a service: users can sign up and deploy their web applications within minutes. The shinyapps.io platform is a particularly interesting example of a SaaS product because R developers who come from field-specific backgrounds other than programming are often not familiar with web development and hosting websites. Some of these developers, for whom R might be their first programming language, are suddenly empowered to develop and run online content thanks to the R shiny web application framework that uses R to generate HTML, CSS and JavaScript based applications. Still, those applications need to be hosted somewhere. This is precisely what shinyapps.io does. The service solely hosts web applications that were created with the shiny web application framework. This ease of use is also the biggest limitation of SaaS products. A website generated with another tool cannot be deployed easily. In addition, the mere bang-for-buck price is rather high compared to self-hosting as users pay for a highly, standardized, managed hosting product. Nevertheless, because of the low volume of most projects, SaaS is feasible for many projects, especially at a proof of concept stage.

In case you are interested in getting started with shiny, take a look at the shiny case study in this book. The study explains basic concepts of shiny to the user and walks readers through the creation and deployment of a simple app.

SaaS, of course, is neither an R nor a data science idea. Modern providers offer databases, storage calendars, face recognition, location services among other things.

7.2.2 Self Hosted

The alternative approach to buying multiple managed services, is to host your applications by yourself. Since – this applies to most users at least – you do not take your own hardware, connect to your home Wi-Fi and aim to start your own hosting provider, we need to look at different degrees of self-hosting. Larger organizations, e.g., universities, often like to host application on their own hardware, within their own network to have full control of their data. Yet, self-hosting exposes you to issues, such as attacks, that you would not need to worry about as much in a software as a service setting (as long as you trust the service).

Self-hosting allows you to host all the applications you want on a dedicated machine. Self-hosters can configure their server depending on their access rights. Offerings range from root access that allows users to do anything to different shades of managed hosting with more moderated access. Backed by virtual infrastructure, modern cloud providers offer a very dynamic form of self-hosting: their clients can use a web GUIs and/or APIs to add, remove, reboot and shutdown nodes. Users can spin up anything from pre-configured nodes optimized for different use cases to containerized environments and entire Kubernetes (K8s) clusters in the cloud. Flexible pricing models allow paying based on usage in a very dynamic fashion.

7.3 Building Blocks

Exactly because of this dynamic described above and the ubiquity of the cloud, it is good to know about the building blocks of modern IT infrastructure.

7.3.1 Virtual Machines

Virtual machines (VMs) remain the go-to building blocks for many set-ups. Hence, university IT, private sector IT, independent hosting providers and online giants all offer VMs. Virtual machines allow running a virtual computer that has its own operating system on some host machine. Running applications on such a virtual computer feels like running an application on a standalone computer dedicated to run this application.

Oracle’s Virtual Box is great tool to use and try virtual machines locally. Virtual Box allows to run a Virtual Windows or Linux inside Mac OS and vice versa. Running a virtual box locally may not be the most performant solution but it allows to have several test environments without altering one’s main environment.

7.3.2 Containers & Images

At the first glimpse, containers look very much like Virtual Machines to the practitioner. The difference is that every Virtual Machine has its own operating system, while containers use the host OS to run a container engine on top of the OS. By doing so, containers can be very lightweight and may take only a few seconds to spin up, while spinning up Virtual Machines can take up to a few minutes - just like booting physical computers. Hence, docker containers are often used as single purpose environments: Fire up a container, run a task in that environment, store the results outside the container and shut down the container again.

Docker (Why docker?) is the most popular containerized solution and quickly became synonymous to container environments configured in a file. So-called docker images are built layer by layer based on other less specific docker images. A DOCKERFILE is the recipe for a new image. Images are blueprints for containers, an image’s running instance. A docker runtime environment can build images from DOCKERFILEs and distribute these images to an image registry. The platform dockerhub hosts a plethora of pre-built docker images from ready-to-go databases to Python ML environments or minimal Linux containers to run a simple shell script in a lab type environment.

Containers run in a docker runtime environment and can either be used interactively or in batches which execute a single task in an environment specifically built for this task. One of the reasons why docker is attractive to researchers is its open character: Dockerfiles are a good way to share a configuration in a simple, reproducible file, making it easy to reproduce setups. Less experienced researchers can benefit from Dockerhub which shares images for a plethora of purposes, from mixed data science setups to database configuration. Side effect free working environments for all sorts of tasks can especially be appealing in exotic and/or dependency heavy cases.

Beside simplification of system administration, docker is known for its ability to work in the cloud. All major cloud hosters offer docker environments and the ability to deploy docker containers that were previously developed and tested locally. You can also use docker to tackle throughput problems using container orchestration tools like Docker Swarm or K8s (say: Kubernetes) to run hundreds of containers (depending on your virtual resources).

7.3.3 Kubernetes (K8s)

Though hosting Kubernetes is clearly beyond the scope of basic level DevOps, the ubiquity of the term and technology as well as the touching points and similarities with the previously introduced concept of containers justify a brief positioning of Kubernetes. We cannot really see Kubernetes as a building block like the technologies introduced above. K8s is a complete cluster with plenty of features to manage the system and its applications. Kubernetes is designed to run on multiple virtual nodes and distribute processes running in so-called pods across its nodes.

Because Kubernetes is quite some overhead to manage, the big three and their alternatives all offer services such as a container registry to support their clients in running Kubernetes. Without any support and/or pre-configured parts, it is…

  • manage nodes in line with resource
  • HA mode
  • management features
  • minikube

7.4 Applied Containerization Basics

While the above Building Blocks contextualizes the container approaches, this section gives a simple 101 into the basics of containerization, enabling the reader to take their first steps in the container world. One of the beautiful things about containers is that due to their isolated nature, one can go along way trying out things as containers get destroyed and recreated all the time. Also, because containers run in a standardized runtime environment, locally developed images easily transfer to large remote machines and clusters.

7.4.1 Dockerfiles

Dockerfiles are text file recipes to create images, i.e., blueprints for containers. One great thing about container images is that they are layered. That is, one can stack images and benefit from previously created images. The below example Dockerfile uses a standard, publicly available image from dockerhub.com and adds some custom packages.

FROM rocker/shiny:latest
RUN apt-get update
RUN apt-get install -qq -y libpq-dev
RUN install2.r ggplot2 shiny shinydashboard  \
               shinydashboardPlus  \
               dplyr RPostgres  

In this case, we make use of a pre-built image from the rocker project. The rocker project designs useful images around the R language ecosystem, builds them on regular basis and makes them available via dockerhub. Here, our image allows running the open source version of shiny server in a docker container. We add a Postgres driver at the operating system level before we install several R packages from CRAN.

7.4.2 Building & Running Containers

There are plenty of ways to run and build containers. Online tools either offered as a service or self-hosted can build images server-side. Yet, the easiest way to get started with containers is to run and build them locally with Docker Desktop.
Even though Docker may not even be the best way to build containers to some, Docker is by far the most known way and therefore comes with the largest ecosystem and most community material.

Docker Desktop is an easy-to-use application available on Windows and OSX. With Docker Desktop, one can execute Docker commands and build images, either in a GUI or using its CLI. The below table shows a few of the most basic docker commands.

command description
docker run start application
docker ps list containers
docker images list docker images
docker pull pull images from registry, e.g., dockerhub.com
docker run run container based on image
docker kill kill container
docker build <directory-with-docker-file> build image based on Dockerfile
docker exec <command> <container> execute command inside container

To learn what a container looks like, e.g., to find out how the container was affected by changes to the Dockerfile, it can be very illustrative to walk around inside. To do so with a container created from the above Dockerfile, start the container and execute a bash with the interactive flag -it.

docker run -d rocker/shiny
docker exec -it rocker/shiny /bin/bash

# alternative you could start R right away
# note that you need to know the location of the executable
docker exec -it rocker/shiny /usr/local/bin/R

7.4.3 Docker Compose - Manage Multiple Containers

Docker is a great way to give a new tool a spin without affecting one’s proven environment. So, even if you are a container beginner, the time when you would like to spin up multiple containers at once will come quickly. While you can start as many containers as your local resources allow for, running containers at once does not necessarily mean those containers are aware of each other, let alone could talk to each other.

Modern applications often follow modular architecture patterns, i.e., they have a front end, some middle layer such as a REST API and a database backend. A web application may have a statically generated HTML front and simply expose HTML / CSS / JavaScript files and query a REST API. The REST API may use the express.io framework and is served using a node server which talks to a Postgres database backend. Each of these three parts could live in its own container. This is where docker could help to create a development environment locally that essentially mimics the production setup and therefore facilitates deployment to production.

Docker compose allows defining how multiple containers play together. Consider the following example file that creates two containers: a shiny web server and a database which can be queried by the shiny server.

services:
   postgres:
      # a name, e.g.,  db_container is instrumental to be
      # called as host from the shiny app
      container_name: db_container
      build: ./postgres
      restart: always
      environment:
         - POSTGRES_USER=postgres
         - POSTGRES_PASSWORD=postgres
      # This port mapping is only necessary to connect from the host, 
      # not to let containers talk to each other. 
      ports:
         - "5432:5432"
      volumes:
         - "./pgdata:/var/lib/postgresql/data"
   shiny: 
      container_name: shiny
      depends_on: 
         - postgres
      build: ./shiny
      volumes:
         - "./shiny-logs:/var/log/shiny-server"
         - "./shiny-home:/srv/shiny-server"
      ports:
         - "3838:3838"

Note how images are built from local directories postgres and shiny that contain Dockerfiles. It is also possible to pull images directly from a registry. To run such a system of multiple containers, simply use

docker compose up --force-recreate

Note that docker compose does not replace an orchestrator and cannot provide cluster functionality like Docker Swarm or even Kubernetes. It should not be used in a production setup unless you know what you are doing.