8  Automation

Make repetitive work fun again!

One aspect that makes a programming approach to data science and analytics appealing is automation. Data ingestion, dissemination, reporting or even analysis itself can be repetitive. Particularly shorter update cycles of one’s data ask for a way to make yet another iteration pain free. The following sections look at different forms of automation such continuous integration and deployment, different forms of workflow automation as well as infrastructure as code.

8.1 Continuous Integration / Continuous Deployment (CI/CD)

Because of its origin in build, test and check automation, CI/CD may not be the first thing that comes into mind when one approaches programming through the analytics route. Yet, thorough testing and automated builds have not only become well established parts of the data science workflow, CI/CD tools can also help to automate tasks beyond testing and packaging your next release.

Modern version control software providers are an easy way to add the tool chain that is often fuzzily called CI/CD (Continuous Integration / Continuous Development). While CI stands for continuous integration and simply refers to a workflow in which the team tries to release new features to production as continuously as possible, CD stands for either continuous delivery or continuous deployment.

Thanks to infrastructure as code and containerization, automation of development and deployment workflows become much easier also because local development can run in an environment very close to the production setup. Git hosting powerhouses GitHub and GitLab run their flavors of CI/CD tools, making the approach well documented and readily available by default to a wide array of users: [GitHub Actions])(https://docs.github.com/en/actions) and GitLab CI. In addition, services like Circle CI offer this tool chain independently of hosting git repositories.

Users of the above platforms can upload a simple text file that follows a name convention and structure to trigger a step-based tool chain based on an event. An example of an event may be a push to a repository’s main branch. A common example would be to run tests and/or build a package and upon success deploy the newly created package to some server – all triggered by simple push to master. One particularly cool thing is, that there are multiple services which allow running the testing on their servers using container technologies. This leads to a great variety of setups for testing. That way, software can easily be tested on different operating systems / environments.

Here is a simple sketch of a .gitlab-ci.yml configuration that builds and tests on pushes to all branches and deploys a package after a push to the main branch and successful build and test steps:

stages:
- buildncheck
- deploy_pack

test:
image:
name: some.docker.registry.com/some-image:0.2.0
entrypoint:
- ""
stage: buildncheck
artifacts:
untracked: true
script:
- rm .gitlab-ci.yml # we don't need it and it causes a hidden file NOTE
- install2.r --repos custom.mini.cran.ch .
- R CMD build . --no-build-vignettes --no-manual
- R CMD check --no-manual *.tar.gz

deploy_pack:
only: 
- main
stage: deploy_pack
image:
name: byrnedo/alpine-curl
entrypoint: [""]
dependencies:
- 'test'
script:
- do some more steps to login and deploy to server ...

For more in depth examples of the above, Jim Hester’s talk on GitHub Actions for R is an excellent starting point. CI/CD tool chains are useful for a plethora of actions beyond mere building, testing and deployment of software. The publication chapter covers another common use case: rendering and deployment of static websites. That is, websites that are updated by re-rendering their content at build time, creating static files (artifacts) that are uploaded to web host. Outlets like the GitHub Actions Marketplace, the r-lib collections for R specific actions and the plethora of readily available actions are a great showcase of the broad use of CI/CD applications.

8.2 Cron Jobs

Cron syntax is very common in the UNIX world and is also useful in the context of the CI/CD tools explained above. Instead of on dispatch or event based triggers, CI/CD processes can also be triggered based on time.

Named after the UNIX job scheduler cron, a cron job is a task that runs periodically at fixed times. Pre-installed in most LINUX server setups, data analysts often use cronjobs to regularly run batch jobs on remote servers. Cron jobs use a funny syntax to determine when jobs run.

#min hour day month weekday
15 5,11 * * 1 Rscript run_me.R

The first position denotes the minute mark at which the job runs - in our case 15 minutes after the new our started. The second mark denotes hours during the day – in our case the 5th and 11th hour. The asterisk * is a wildcard expression for running the job on every day of the month and in every month throughout the year. The last position denotes the weekday, in this example we run our job solely on Mondays.

More advanced expressions can also handle running a job at much shorter intervals, e.g., every 5 minutes.

*/5 * * * * Rscript check_status.R

To learn and play with more expressions, check crontab guru. If you have more sophisticated use cases, like overseeing a larger number of jobs or execution on different nodes, consider using Apache Airflow as a workflow scheduler.

8.3 Workflow Scheduling: Apache Airflow DAGs

Though much less common than the above tools, Apache Airflow has definitely earned a mention because of its ability to help researchers keep an overview of regularly running processes. Examples of such processes could be daily or monthly data sourcing or timely publication of a regularly published indicator. I often referred to Airflow as cronjobs on steroids. Airflow ships with a dashboard to keep track of many timed processes, plus a ton of other logs and reporting features, worth a lot when maintaining reoccurring processes. Airflow has its own Airflow Summit conference, a solid community and docker compose setup to get you started quickly. The setup consists of a container image for the web frontend and another image for the PostgreSQL backend. The fact that there is also a Managed Workflow for Apache Airflow offering in the Amazon cloud at the time of writing shows the broad acceptance of the tool. Airflow also runs on Kubernetes in case you are interested in hosting Airflow in a more robust production setup.

So, what does Airflow look like in practice? Airflow uses the Python language to define Directed Acyclic Graphs (DAGs). Essentially, a DAG pictures a process that has – unlike a cycle – a clear start and ending. DAGs can be one dimensional and fully sequential, but can also run tasks in parallel.

from datetime import datetime, timedelta
from airflow import DAG
# Operators; we need this to operate!
import BashOperator
with DAG(
    default_args={
        "depends_on_past": False,
        "email": ["airflow-admin@your-site.com"],
        "email_on_failure": False,
        "email_on_retry": False,
        "retries": 1,
        "retry_delay": timedelta(minutes=5)
    },
    description="Bash Operator"
    schedule='5 11 * * *',
    start_date=datetime(2023, 1, 1),
    catchup=False,
    tags=["example"],
) as dag:
    # t1, t2 and t3 are examples of tasks created by instantiating operators
    t1 = BashOperator(
        task_id="print_date",
        bash_command="date > date.txt",
    )

    t2 = BashOperator(
        task_id="sleep",
        depends_on_past=False,
        bash_command="sleep 5",
        retries=3,
    )
    
    t3 = BashOperator(
        task_id="2 dates",
        depends_on_past=False,
        bash_command="cat date.txt & date",
    )

    t1 >> t2 >> t3

The above code shows an example of a simple DAG that combines three strictly sequential tasks. Task t1 depends on tasks t2 and t3. In this simple example, all tasks use the Airflow BashOperator to execute bash commands, but Airflow provides plenty of other operators from its PythonOperator, to Docker or Kubernetes operators.
Such operators also allow executing tasks on remote machine or clusters, not solely the machine or container that serves Airflow itself.

8.4 Make-Like Worflows

Makefiles and the make software that made them popular are the classic of build automation. Best known for its support in the compilation of C programs, makefiles became popular across programming languages thanks to its straightforward concept and readable approach. A Makefile contains a target, an optional prerequisite, and a recipe to get there.

target: prerequisite
    recipe

Like this:

say_yo: store_yo
    cat echo.txt

store_yo: 
    echo "yo!" > echo.txt

This will simply print

echo "yo!" > echo.txt
cat echo.txt
yo!

in your terminal. Very much like Apache Airflow and its DAGs which were introduced above, Makefiles allow declaring and manage dependencies. The ability of make and make like approaches such as the targets R package goes well beyond the simple sequential definitions like the basic example above. Parallel execution of independent tasks, cherry-picking execution of single tasks from a larger workflow, variables and caching are highly useful when runtime increases. Consider, a multistep workflow, parts of which are running for quite a while but hardly change while other swift parts change all the time, and it becomes obvious how a simple CLI-based workflow helps.
Modern implementation like the target R package ship with network visualizations of the execution path and its dependencies. The walkthrough provided in R OpenSci’s book on target is a great summary available as text, code example and video.

8.5 Infrastructure as Code

In recent years, declarative approaches helped to make task automation more inclusive and appealing to a wider audience. Not only a workflow itself, but also the environment a workflow should live and run in is often defined in declarative fashion. This development does not only make maintenance easier, it also helps to make a setting reproducible and shareable.

Do not think infrastructure as code is only relevant for sys admins or other infrastructure professionals who make use of it every day. The ability to reproduce and tweak infrastructure at essentially no costs enables other forms of automation such as CI/CD. Just like with flat rate with your mobile carrier will lead to more calls.

Infrastructure as code approaches do not only describe infrastructure in declarative and reproducible fashion as stated above, infrastructure as code can also be seen as a way to automate setting up the environment you work with. The infrastructure section explains in greater detail how such definitions look like and containers are building blocks of modern infrastructure defined in code.

Automation is more complex for cluster setups because among other things, applications need to be robust against pods getting shut down on one node and spawned on another node allowing to host applications in high availability mode. On Kubernetes clusters, helm, Kubernetes’ package manager, is part of the solution to tackle this added complexity. Helm is “the best way to find share and use software built for Kubernetes”. Terraform can manage Kubernetes clusters on clouds like Amazon’s AWS, Google’s GPC, Microsoft’s Azure and many other cloud systems using declarative configuration files. Again, a console-based client interface is used to apply a declarative configuration plan to a particular cloud.