4  Interaction Environment

As a kid of the 90s, reading about a new editor or fancy shell coming out triggers playing Inspector Gadget’s tune in my head. To invest a few fun hours into exploring a new tool that might save me a split second or two in processes that I run a hundred times daily may not always pay off by the hour even in a lifetime of programming, but is totally worth it in most cases - at least to me. Repetitive work is tiring and error-prone, adds to cognitive workload, and shifts focus away from hard-to-tackle puzzles.

That being said, I am nowhere near the vim aficionados whose jaw-dropping editing speed probably takes 1000s of hours to master and configure, and would not advise overengineering setting up your environment. Like often, finding a personal middle ground between feeling totally uncomfortable outside graphical user interfaces and editing code at a 100 words per minute pace is key.

Wizard or Carpenter? Aiming at solid mastery of your toolset is sufficient to become happy with programming approach to analytics.

Most likely, investing in an initial setup of a solid editor plus some basic terminal workflows are a good starting point that can be revisited every once in a while, e.g., when we start a larger new project. What editor you will center your environment around will depend a lot on the programming language you chose. Yet, it also depends on your personal preference of how much investment, support and supportive features you want. The following features / criteria are likely the most influential when composing a programming environment for statistical computing:

4.1 Integrated Development Environments (IDE)

While some prefer lightning fast editors such as Sublime Text that are easy to customize but rather basic out-of-the-box, IDEs are the right choice for most people. In other words, it’s certainly possible to move a five-person family into a new home by public transport, but it is not convenient. The same holds for (plain) text editors in programming. You can use them, but most people would prefer an Integrated Development Environment (IDE) just like they prefer to use a truck when they move. IDEs are tailored to the needs and idiosyncrasies of a language, some working with plugins and covering multiple languages. Others have a specific focus on a single language or a group of languages.

The below sections will focus on data science’ most popular editors, namely Visual Studio Code and RStudio. Hence, I would like to mention at least some IDE juggernauts here, for the reader looking or alternatives: Eclipse (mostly Java but tons of plugins for other languages), or JetBrains’ IntelliJ (Java) and PyCharm (Python).

4.1.1 RStudio

Posit’s RStudio has become the default environment for most R people and those who mostly use R but C or Python and Markdown in addition. The open source version ships for free as RStudio Desktop and RStudio Server Open Source. In addition, the creators of RStudio offer commercially supported versions of both the desktop and server version (Posit Workbench). If you want your environment to essentially look like the environment of your peers, RStudio is a good choice. To have the same visual image in mind can be very helpful in workshops, coaching or teaching.

RStudio has 4 control panes which the user can fill with a different functionality like script editor, R console, terminal, file explorer, environment explorer, test suite, build suite and many others. I advise newcomers to change the default to have the script editor and the console next to each other (instead of below each other). That is because (at least in the Western world) we read from left to right and send source code from left to right to execute it in the console. Combine this practice with the run selection shortcut (cmd+enter or ctrl+enter on a PC) and you have gained substantial efficiency compared to leaving your keyboard, reaching for your mouse and finding the right button. In addition, this workflow should allow you to see larger chunks1 of your code as well as your output.

Explore Extensions

  • explore addins

  • explore the API

Favorite Shortcuts

  • use cmd+enter (ctrl+enter on PCs) to send selected code from the script window (on the left) to the console (on the right)

  • cmd+shift+option+R, while the cursor is within a function’s body (create a roxygen documentation skeleton)

  • use ctrl 1,2 to switch between console and script pane.

Pitfalls

  • You may stumble over RStudio’s defaults, such storing your global environment on exit and thus resurrecting long forgotten objects impacting your next experiment through, e.g., lexical scoping.
  • RStudio’s git integration abstracts git essentials away, so it hampers understanding what’s going on.
  • below vs beyond

For R, the Open Source Edition of R Studio Desktop is the right choice for most people. (If you are working in a team, R Studio’s server version is great. It allows having a centrally managed server which clients can use through their web browser without even installing R and R Studio locally.) R Studio has solid support for a few other languages often used together with R, plus it’s customizable. The French premier thinkR Colin_Fay gave a nice tutorial on Hacking R Studio at the useR! 2019 conference.

Back in fall 2020, long before RStudio turned into Posit, the company already indicated that data science was not about R vs. Python to them (Remember, the introduction of the programming chapter of this book?)

Then RStudio already indicated years ago, it was not solely about R

4.1.2 Visual Studio Code

Outside the R universe (and to some degree even inside it), Visual Studio Code became the go-to editor in data science. Microsoft’s Visual Studio Code started out as a modular, general purpose IDE not focused on a single language. Meanwhile, there is not only great Python support, but also auto-completion, code highlighting for R or Julia and many other languages. VS Code Live Share is just one rather random example of its remarkably well implemented features. Live share allows developers to edit a document simultaneously using multiple cursors similarly to Google Docs, but with all the IDE magic. And in a Desktop client.

4.1.3 Editors on Steroids

Another approach is to go for a highly customizable editor such as Sublime or Atom. The idea is to send source code from the editor to interchangeable REPLs which can be adapted according to the language that needs to be interpreted. That way a good linter / code highlighter for your favorite language is all you need to have a lightweight environment to run things. An example of such a customization approach is Christoph Sax’ small project Sublime Studio.

For readers with UNIX experience, vim may just be the most ubiquitous editor of them all, to everyone else it may just live in deep nerd territory. The fact a quick online search for ‘How to quit vim’ came back with more than 56.4 million (!) results shows that both perspectives have a point. With vim, users can switch between a move around and insert mode. The former allows the users to use single letters as shortcuts to navigate a text file instead of typing the actual letter. This enables users to do things like move-three-words-forward or delete the next three lines and many other more complex things. Given some practice and regular use, it is easy to imagine that vim wizards can navigate their source code incredibly quick.
Back when IDEs were less comfortable and often clunky due to their heavy lifting, a broader group of people had their incentives to invest into vim. Now, that IDEs became so much better and many of them even offering vim modes or plugins, the point of contact with vim for a data scientist is mostly UNIX server administration or work inside containers. GNU Emacs is another noteworthy editor because, even though much more exotic, it is popular among long tenured R folks thanks to the Emacs Speaks Statistics extension.

4.2 Notebooks

Notebooks are yet another popular choice for a home court among data scientists and analysts. Though most often associated with the Python world, notebooks are language-agnostic and also common for the R and Julia languages. The basic idea of notebooks is to run web server in the background to present a browser-based frontend to the developer. The difference between the Markdown rendering approach described in the publishing chapter is that the browser is not just used to display the result, but is also the interaction environment to interactively add commands. The resulting workflow for, e.g., a data analysis in the making feels like a social media timeline: execute a command, receive a result posted on a web page printed below the command, again posting a prompt to expect the next command. This way we get an endless scroll of commands, analysis and descriptive text in between.

Consider the following example, drawing a basketball court using the matplotlib Python package. This example uses screenshots of a notebook to illustrate the difference between a notebook and concepts like RMarkdown.

Note the prompts in between the results!

The above figure depicts a markdown element in editing mode, showing the Markdown syntax (double ## for a section header of type 2). The second element is a chunk of Python code to import two well-known Python libraries.

After the function definition, we see another markdown section header element that says Call the Function – this time already rendered to HTML. Finally, we call the Python function defined above in another code chunk element2.

4.3 The Console / Terminal

In addition to the editor you will spend most of your time with, it is also worthwhile to put some effort into configuring keyboard-driven interaction with your operating systems. And again, you do not need to be a terminal virtuoso to easily outperform mouse pushers, a terminal carpentry level is enough. Terminals come in all shades of gray, integrated into IDEs, built into our operating system or as pimped third-party applications. A program called a shell runs inside the terminal application to run commands and display output. bash is probably the most known shell program, but there are tons of different flavors. FWIW, I love fish shell (combined with iterm2) for its smooth auto-completion, highlighting and its extras.

In the Windows world, the use of terminal applications has been much less common than for OSX/Linux – at least for non-systemadmins. Git Bash which ships with git version control installations on Windows mitigates this shortcoming as it provides a basic UNIX style console on windows. For a full-fledged UNIX terminal experience, I suggest using a full terminal emulator like CYGWIN. More recent, native approaches like powershell brought the power of keyboard interaction at the OS level to a broader Windows audience – albeit with different Windows specific syntax. The ideas and examples in this book are limited to UNIX flavored shells.

Basic Unix Terminal commands
Command What it does
ls list files (and directories in a directory)
cd change directory
mv move file (also works to rename files)
cp copy files
mkdir make directory
rmdir remove directory
rm -rf (!) delete everything recursively. DANGER: shell does not ask for confirmation, just wipes out everything.

4.3.1 Remote Connections SSH, SCP

One of the most important use cases of the console for data analysts is the ability to log into other machines, namely servers that most often run on LINUX. Typically, we use the SSH protocol to connect to a (remote) machine that allows to connect through port 22.

Note that sometimes firewalls limit access to ports other than those needed for surfing the web (8080 for http:// and 443 for https://) so you can only access port 22 inside an organization’s VPN network.

To connect to a server using a username and password simply use your console’s ssh client like this:

ssh mbannert@someserver.org

You will often encounter another login procedure, though. RSA key pair authentication is more secure and therefore preferred by many organizations. You need to make sure the public part of your key pair is located on the remote server and hand the ssh command the private file:

ssh -i ~/.ssh/id_rsa mbannert@someserver.org

For more details on RSA key pair authentication, check this example from the case study chapter Section 11.1.

While ssh is designed to log in to a remote server and from then on, issue commands like the server was a local Linux machine, scp copies files from one machine to another.

scp -i ~/.ssh/id_rsa ~/Desktop/a_file_on_my_desktop
 mbannert@someserver.org:/some/remote/location/

The above command copies a file dwelling on the user’s desktop into a /some/remote/location on the server. Just like ssh, secure copy (scp) can use RSA key authentication, too.

4.3.2 Git through the Console

Another common use case of the terminal is managing git version control. Admittedly, there is git integration for many IDEs that allows you to point and click your way to commits, pushes and pulls as well as dedicated git clients like GitHub Desktop or Source Tree. But there is nothing like the console in terms of speed and understanding what you really do. The Chapter 5 sketches an idea of how to operate git through the console from the very basics to a feature branch-based workflow.


  1. Many coding conventions recommend having no more than 80 characters in one line of code. Sticking to this convention should prevent cutting off your code horizontally.↩︎

  2. Though notebooks were originally designed to run on a (local) Python based webserver and to be used in a web browser, there is a neat, Electron based standalone app called JupyterLab. I have used this app for the illustration in this book because of its slim, no nonsense interface that unlike browsers comes without distractions from plugins.↩︎