4 Interaction Environment
As a kid of the 90s, reading about a new editor or fancy shell coming out triggers playing Inspector Gadget’s tune in my head. To invest a few fun hours into exploring a new tool that might save me a split second or two in processes that I run a hundred times daily may not always pay off by the hour even in a lifetime of programming, but it is totally worth it in most cases – at least to me. Repetitive work is tiring and error-prone, adds to cognitive workload, and shifts focus away from hard-to-tackle puzzles.
That being said, I am nowhere near the vim aficionados whose jaw-dropping editing speed probably takes thousands of hours to master and configure, and I would not advise overengineering setting up your environment. Like often, finding a personal middle ground between feeling totally uncomfortable outside graphical user interfaces and editing code at a 100 words per minute is key.
Most likely, investing in an initial setup of a solid editor plus some basic terminal workflows are a good starting point that can be revisited every once in a while, e.g., when we start a larger new project. What editor you will center your environment around will depend a lot on the programming language you choose. Yet, it also depends on your personal preference of how much investment, support and supportive features you want. The following features/criteria are likely the most influential when composing a programming environment for statistical computing:
- code highlighting, i.e., use of colors to highlight the structure of your code
- linting scans the code for potential errors and displays hints in the editor
- integrated debuggers allow running parts of the code line-by-line and to inspect the inner workings of functions
- multi-language support is important when working in multiple programming or markup languages.
- terminal integration helps to run stuff using system commands
- git integration helps interact with git and do basic add, commit, push operations through the editor’s GUI
- build tools for programs that need rendering and compilation
- customizable through add-ins/macros
4.1 Integrated Development Environments
While some prefer lightning-fast editors such as Sublime Text that are easy to customize but rather basic out-of-the-box, Integrated Development Environments (IDEs) are the right choice for most people. In other words, it’s certainly possible to move a five-person family into a new home by public transport, but it is not convenient. The same holds for (plain) text editors in programming. You can use them, but most people would prefer an IDE just like they prefer to use a truck over public transport when they move. IDEs are tailored to the needs and idiosyncrasies of a language, some working with plugins and covering multiple languages. Others have a specific focus on a single language or a group of languages.
The below sections will focus on data science’ most popular editors, namely Visual Studio Code and RStudio. Hence, I would like to mention at least some IDE juggernauts here, for the reader looking for alternatives: Eclipse (mostly Java but tons of plugins for other languages), or JetBrains’ IntelliJ (Java) and PyCharm (Python).
4.1.1 RStudio
Posit’s RStudio has become the default environment for most R people and those who mostly use R but C or Python and Markdown, in addition. The open source version ships for free as RStudio Desktop and RStudio Server Open Source. In addition, the creators of RStudio offer commercially supported versions of both the desktop and server version (Posit Workbench). If you want your environment to essentially look like the environment of your peers, RStudio is a good choice. To have the same visual image in mind can be very helpful in workshops, coaching or teaching.
RStudio has four control panes which the user can fill with a different functionality like script editor, R console, terminal, file explorer, environment explorer, test suite, build suite and many others. I advise newcomers to change the default to have the script editor and the console next to each other (instead of below each other). That is because (at least in the Western world) we read from left to right and send source code from left to right to execute it in the console. Combine this practice with the run selection shortcut (cmd+enter or ctrl+enter on a PC) and you have gained substantial efficiency compared to leaving your keyboard, reaching for your mouse and finding the right button. In addition, this workflow should allow you to see larger chunks1 of your code as well as your output.
Explore Extensions
explore add-ins
explore the API
Favorite Shortcuts
use cmd+enter (ctrl+enter on PCs) to send selected code from the script window (on the left) to the console (on the right)
cmd+shift+option+R, while the cursor is within a function’s body (create a Roxygen documentation skeleton)
use ctrl 1,2 to switch between console and script pane
Pitfalls
- you may stumble over RStudio’s defaults, such as storing your global environment on exit and thus resurrecting long forgotten objects impacting your next experiment through, e.g., lexical scoping.
- RStudio’s git integration abstracts git essentials away, so it hampers understanding of what’s going on.
For R, the Open Source Edition of RStudio Desktop is the right choice for most people. (If you are working in a team, R Studio’s server version is great. It allows having a centrally managed server which clients can use through their web browser without even installing R and RStudio locally.) R Studio has solid support for a few other languages often used together with R, plus it’s customizable. The French premier thinkR Colin_Fay gave a nice tutorial on Hacking RStudio at the useR! 2019 conference.
Back in fall 2020, long before RStudio turned into Posit, the company already indicated that data science was not about R vs. Python to them (Remember the first section of Chapter 3 of this book: The Choice That Doesn’t Matter of this book?)
4.1.2 Visual Studio Code
Outside the R universe (and to some degree even inside it), Visual Studio Code became the go-to editor in data science. Microsoft’s Visual Studio Code started out as a modular, general purpose IDE not focused on a single language. Meanwhile, there is not only great Python support, but also auto-completion, code highlighting for R or Julia and many other languages. VS Code Live Share is just one rather random example of its remarkably well-implemented features. Live share allows developers to edit a document simultaneously using multiple cursors similarly to Google Docs, but with all the IDE magic in a Desktop client.
4.1.3 Editors on Steroids
Another approach is to go for a highly customizable editor such as Sublime2 or Atom3. The idea is to send source code from the editor to interchangeable read-eval-print-loop (REPL)s) which can be adapted according to the language that needs to be interpreted. That way a good linter/code highlighter for your favorite language is all you need to have a lightweight environment to run things. An example of such a customization approach is Christoph Sax’s small project Sublime Studio4.
For readers with Unix experience, vim5 may just be the most ubiquitous editor of them all, to everyone else it may just live in deep nerd territory. The fact that a quick online search for “How to quit vim” came back with more than 56.4 million (!) results shows that both perspectives have a point. With vim, users can switch between a move around and insert mode. The former allows the users to use single letters as shortcuts to navigate a text file instead of typing the actual letter. This enables users to do things like move-three-words-forward or delete the next three lines and many other more complex things. Given some practice and regular use, it is easy to imagine that vim wizards can navigate their source code incredibly quickly.
Back when IDEs were less comfortable and often clunky due to their heavy lifting, a broader group of people had their incentives to invest into vim. Now that IDEs became so much better and many of them even offering vim modes or plugins, the point of contact with vim for a data scientist is mostly Unix server administration or work inside containers. GNU Emacs6 is another noteworthy editor because, even though much more exotic, it is popular among long-tenured R folks thanks to the Emacs Speaks Statistics extension.
4.2 Notebooks
Notebooks are yet another popular choice for a home court among data scientists and analysts. Though most often associated with the Python world, notebooks are language-agnostic and also common for the R and Julia languages. The basic idea of notebooks is to run web server in the background to present a browser-based frontend to the developer. The difference between the Markdown rendering approach described in Chapter 10 about publishing is that the browser is not just used to display the result, but it is also the environment to interactively add commands. The resulting workflow, e.g., a data analysis in the making feels like a social media timeline: execute a command, receive a result posted on a web page printed below the command, again posting a prompt to expect the next command. This way we get an endless scroll of commands, analysis and descriptive text in between.
Consider the following example, drawing a basketball court using the matplotlib Python package. This example uses screenshots of a notebook to illustrate the difference between a notebook and concepts like RMarkdown. Note the prompts in between the results!
The above figure depicts a Markdown element in editing mode, showing the Markdown syntax (double ## for a section header of type 2). The second element is a chunk of Python code to import two well-known Python libraries.
After the function definition, we see another Markdown section header element that says Call the Function – this time already rendered to HTML. Finally, we call the Python function defined above in another code chunk element7.
4.3 Console/Terminal
In addition to the editor with which you will spend most of your time, it is also worthwhile to put some effort into configuring keyboard-driven interaction with your operating systems. And again, you do not need to be a terminal virtuoso to easily outperform mouse pushers, a terminal carpentry level is enough. Terminals come in all shades of gray, integrated into IDEs, built into our operating system or as pimped third-party applications. A program called a shell runs inside the terminal application to run commands and display output. bash is probably the most known shell program, but there are tons of different flavors. FWIW, I love fish8 shell (combined with iterm2) for its smooth auto-completion, highlighting and its extras.
In the Windows world, the use of terminal applications has been much less common than for OSX/Linux – at least for non-systemadmins. Git Bash, which ships with git installations on Windows, mitigates this shortcoming, as it provides a basic Unix style console for Windows. For a full-fledged Unix terminal experience, I suggest using a full terminal emulator like CYGWIN. More recent, native approaches like powershell brought the power of keyboard interaction at the OS level to a broader Windows audience – albeit with different Windows specific syntax. The ideas and examples in this book as shown in the below table are limited to Unix flavored shells.
Command | What it does |
---|---|
ls | list files (and directories in a directory) |
cd | change directory |
mv | move file (also works to rename files) |
cp | copy files |
mkdir | make directory |
rmdir | remove directory |
rm -rf | (!) delete everything recursively. DANGER: shell does not ask for confirmation and just wipes out everything. |
4.3.1 Remote Connections SSH, SCP
One of the most important use cases of the console for data analysts is the ability to log into other machines, namely servers that most often run on Linux. Typically, we use the SSH protocol to connect to a (remote) machine that allows to connect through port 22.
Note that sometimes firewalls limit access to ports other than those needed for surfing the web (8080 for http:// and 443 for https://) so you can only access port 22 inside an organization’s VPN network.
To connect to a server using a username and password, simply use your console’s SSH client like this:
ssh mbannert@someserver.org
You will often encounter another login procedure, though. SSH key pair authentication is more secure and therefore preferred by many organizations. You need to make sure the public part of your key pair is located on the remote server and hand the ssh command the private file:
ssh -i ~/.ssh/id_rsa mbannert@someserver.org
For more details on SSH key pair authentication, check this example from the case studies in Chapter 10 Section 11.1.
While SSH is designed to log in to a remote server and from then on, issue commands like the server was a local Linux machine, scp
copies files from one machine to another.
scp -i ~/.ssh/id_rsa ~/Desktop/a_file_on_my_desktop
mbannert@someserver.org:/some/remote/location/
The above command copies a file dwelling on the user’s desktop into a /some/remote/location on the server. Just like SSH, secure copy (scp) can use SSH key pair authentication, too.
4.3.2 Git Through the Console
Another common use case of the terminal is managing git . Admittedly, there is git integration for many IDEs that allows you to point and click your way to commits, pushes and pulls as well as dedicated git clients like GitHub Desktop or Source Tree. But there is nothing like the console in terms of speed and understanding what you really do. Chapter 5 sketches an idea of how to operate git through the console from the very basics to a feature branch-based workflow.
Many coding conventions recommend having no more than 80 characters in one line of code. Sticking to this convention should prevent cutting off your code horizontally.↩︎
https://www.sublimetext.com/↩︎
https://atom.io/↩︎
https://github.com/christophsax/SublimeStudio↩︎
https://www.vim.org/↩︎
https://www.gnu.org/software/emacs/↩︎
Though notebooks were originally designed to run on a (local) Python based webserver and to be used in a web browser, there is a neat, Electron-based standalone app called JupyterLab. I have used this app for the illustration in this book because of its slim, no-nonsense interface that unlike browsers comes without distractions from plugins.↩︎
https://fishshell.com/↩︎