6  Data Management

Admittedly, I have said the same thing about Chapter 5 on version control, yet data management may be the single most impactful chapter of this book. This may be the case in particular in case you come from an environment that mostly organized data in spreadsheets shared via e-mail or network drives. To contextualize data and think about (long term) data management is a step into a bigger world.

After decades of information engineering and computer science, some can’t help but wonder why we have not found one perfect, one-size-fits-all form of data. In fact, a substantial part of programming with data deals with transforming data from one form into another. This chapter intends to give an idea of the aspects of data management most relevant for data analytics. Hopefully, this chapter helps the reader to assess how far they wanted to dig into data management.

6.1 Forms of Data

In research and analytics, data appear in a plethora of different forms. Yet, most researchers and business analysts are mostly trained to handle different flavors of two-dimensional data, as in one-observation-one-row. Ad hoc studies conducted once result in cross-sectional data: one line per observation, columns represent variables. Sensor data, server logs or forecasts of official statistics are examples of single variable data observed over time. These single variable, longitudinal data are also known as time series. Multivariate time series, i.e., multi-variable, longitudinal data are often referred to as panel data. In our heads, all of these forms of data are typically represented as rectangular, two-dimensional one-line-per-observation, spreadsheet like tables. Here are a few easy-to-reproduce examples using popular R demo datasets.

h <- head(mtcars)
h
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
dim(h)
[1]  6 11

The above output shows an excerpt of the mtcars cross-sectional dataset with 6 lines and 11 variables. Airpassenger is a time series dataset represented in an R ts object which is essentially a vector with time-based index attribute.

AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

Let’s create a multivariate time series (panel) dataset, i.e., multiple variables observed over time:

d <- data.frame(Var1 = rnorm(10, 0),
           Var2 = rnorm(10, 10),
           Var3 = rnorm(10, 30))
multi_ts <- ts(d, start = c(2000,1), frequency = 4)
multi_ts
              Var1      Var2     Var3
2000 Q1  0.6428638  8.780140 30.35948
2000 Q2 -1.0156843  9.813669 28.85264
2000 Q3 -0.3286740  8.690206 29.40070
2000 Q4  0.2368141 10.873496 30.34395
2001 Q1 -2.0052139 10.996617 31.00408
2001 Q2  0.8869538  8.783797 29.51489
2001 Q3  0.5119350 10.631907 30.07512
2001 Q4  0.8801927  9.602890 30.25296
2002 Q1  0.4160729 10.213400 29.87545
2002 Q2 -0.6564528  9.872200 29.48070

A Note on Long Format vs. Wide Format

The above multi-variable time series is shown in what the data science community calls wide format. In this most intuitive format, every column represents one variable, time is on the Y-axis. The counterpart is the so called long format shown below. The long format is a machine friendly, flexible way to represent multi-variable data without altering the number of columns with more variables.

library(tsbox)
ts_dt(multi_ts)[1:15,]
      id       time      value
 1: Var1 2000-01-01  0.6428638
 2: Var1 2000-04-01 -1.0156843
 3: Var1 2000-07-01 -0.3286740
 4: Var1 2000-10-01  0.2368141
 5: Var1 2001-01-01 -2.0052139
 6: Var1 2001-04-01  0.8869538
 7: Var1 2001-07-01  0.5119350
 8: Var1 2001-10-01  0.8801927
 9: Var1 2002-01-01  0.4160729
10: Var1 2002-04-01 -0.6564528
11: Var2 2000-01-01  8.7801401
12: Var2 2000-04-01  9.8136692
13: Var2 2000-07-01  8.6902060
14: Var2 2000-10-01 10.8734958
15: Var2 2001-01-01 10.9966173

The ability to transform data from one format into the other and to manipulate both formats is an essential skill for any data scientist or data engineer. It is important to point out that the ability to do the above transformations effortlessly is an absolute go-to skill for people who want to use programming to run analysis. (Different analyses or visualizations may require one form or the other and ask for quick transformation).

Hence, popular data science programming languages offer great tool sets to get the job done. Mastering these toolboxes is not the focus of this book. R for Data Science and the Carpentries are good starting points if you feel the need to catch up or solidify your know-how.

Yet, not all information suits a two-dimensional form. Handling nested or unstructured information is one of the fields where the strength of a programming approach to data analysis and visualization comes into play. Maps are a common form of information that is often represented in nested fashion. For an example of nested data, let’s take a look at the map file and code example case study in Section 11.6. In memory, i.e., in our R session, the data is represented in a list that contains multiple list may contain more lists nested inside.

library(jsonlite)

json_ch <- jsonlite::read_json(
  "https://raw.githubusercontent.com/mbannert/maps/master/ch_bfs_regions.geojson"
)
ls.str(json_ch)
crs : List of 2
 $ type      : chr "name"
 $ properties:List of 1
features : List of 7
 $ :List of 3
 $ :List of 3
 $ :List of 3
 $ :List of 3
 $ :List of 3
 $ :List of 3
 $ :List of 3
type :  chr "FeatureCollection"

Another example of nested but structured data are HTML or XML trees obtained from scraping websites. Typically, web scraping approaches like rvest (Wickham 2022) or beautiful soup (Zheng, He, and Peng 2015) parse the hierarchical Document Object Model (DOM) and turn it into an in-memory representation of a website’s DOM. For DOM parsing example, see CASE STUDY XYZ.

6.2 Representing Data in Files

To create the above examples of different forms of data, it was mostly sufficient to represent data in memory, in this case within an R session. As an interpreted language, an R interpreter has to run at all times when using R. The very same is true for Python. Users of these languages can work interactively, very much like with a pocket calculator on heavy steroids. All functions, all data, are in loaded into a machine’s RAM (memory) represented as objects of various classes. This is convenient, but has an obvious limitation: once the sessions ends, the information is gone. Hence, we need to have a way to store at least the results of our computation in persistent fashion.

Just like in office or image editing software, the intuitive way to store data persistently from a programming language is to store data into files. The choice of the file format is much less straightforward in our case, though. The different forms of data discussed above, potential collaborators and interfaces are factors among others that weigh into our choice of a file format.

6.2.1 Spreadsheets

Based on our two dimensions focused intuition and training, spreadsheets are the on-disk analog of data.frames, data.tables and tibbles. Formats like .csv or .xlsx are the most common way to represent two-dimensional data on disk.
On the programming side, the ubiquity of spreadsheets leads to a wide variety of libraries to parse and write different spreadsheet formats.

import csv
import pandas as pd

d = {'column1': [1,2], 'column2': [3,4]}
df = pd.DataFrame(data=d)
df.to_csv("an_example.csv", sep=";", encoding='utf-8')

Comma separated values (.csv)1 are good and simple option. Their text-based nature makes .csv files language agnostic and human-readable through a text editor.

;column1;column2
0;1;3
1;2;4

Though Excel spreadsheets are a convenient interface to office environments that offer extras such organization into workbooks, the simpler .csv format has advantages in machine-to-machine communication and as an interface between different programming languages and tools. For example, web visualization libraries such as highcharts or echarts are most commonly written in JavaScript and can conveniently consume data from .csv files. The above example .csv file was written in Python and is now easily read by R.

library(readr)
csv <- readr::read_csv2("an_example.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
New names:
Rows: 2 Columns: 3
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" dbl
(3): ...1, column1, column2
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
csv
# A tibble: 2 × 3
   ...1 column1 column2
  <dbl>   <dbl>   <dbl>
1     0       1       3
2     1       2       4

6.2.2 File Formats for Nested Information

For many data engineers and developers. JavaScript Object Notation (JSON) has become the go-to file format for nested data. Just like with .csv basically every programming language used in data science and analytics has libraries to serialize and deserialize JSON (read and write). Though harder to read for humans than .csv, prettified JSON with a decent highlighting color scheme is easy to read and gives the human reader a good understanding of the hierarchy at hand. The added complexity comes mostly from the nested nature of the data, not so much from the file format.

library(jsonlite)

li <- list(
  element_1 = head(mtcars, 2),
  element_2 = head(iris, 2)
)

toJSON(li, pretty = TRUE)
{
  "element_1": [
    {
      "mpg": 21,
      "cyl": 6,
      "disp": 160,
      "hp": 110,
      "drat": 3.9,
      "wt": 2.62,
      "qsec": 16.46,
      "vs": 0,
      "am": 1,
      "gear": 4,
      "carb": 4,
      "_row": "Mazda RX4"
    },
    {
      "mpg": 21,
      "cyl": 6,
      "disp": 160,
      "hp": 110,
      "drat": 3.9,
      "wt": 2.875,
      "qsec": 17.02,
      "vs": 0,
      "am": 1,
      "gear": 4,
      "carb": 4,
      "_row": "Mazda RX4 Wag"
    }
  ],
  "element_2": [
    {
      "Sepal.Length": 5.1,
      "Sepal.Width": 3.5,
      "Petal.Length": 1.4,
      "Petal.Width": 0.2,
      "Species": "setosa"
    },
    {
      "Sepal.Length": 4.9,
      "Sepal.Width": 3,
      "Petal.Length": 1.4,
      "Petal.Width": 0.2,
      "Species": "setosa"
    }
  ]
} 

The above example shows the first two lines of two different, unrelated rectangular datasets. Thanks to the hierarchical nature of JSON both datasets can be stored in the same file albeit totally different columns. Again, just like .csv, JSON works well as an interface, but is more flexible than the former.

Besides JSON, XML is the most common format to represent nested data in files. Though there are a lot of overlapping use cases, there is a bit of a different groove around both of these file formats. JSON is perceived as more light weight and close to ‘the web’ while XML is the traditional, very explicit no-nonsense format. XML has a Document Type Definition (DTD) that defines the structure of the document and which elements and attributes are legal. Higher level formats use this more formal approach as XML based definition. SDMX2, a world-wide effort to provide a format for exchange statistical data and metadata, is an example of such a higher level format build on XML.

<CompactData xmlns:c="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/compact" xmlns:sdds="urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=IMF:ECOFIN_DSD" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:msg="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message">
<Header>
<ID>kofbarometer</ID>
<Test>false</Test>
<Prepared>2022-07-27 23:19:15</Prepared>
<Sender id="KOF">
<Name xml:lang="en">KOF Swiss Economic Institute</Name>
<Contact>
<Name>KOF Swiss Economic Institute</Name>
<URI>http://www.kof.ethz.ch/prognosen-indikatoren/indikatoren/kof-konjunkturbarometer.html</URI>
</Contact>
</Sender>
</Header>
<sdds:DataSet>
<sdds:Series DATA_DOMAIN="FLI" REF_AREA="CH" INDICATOR="AL" COUNTERPART_AREA="_Z" FREQ="M" UNIT_MULT="0" TIME_FORMAT="P3M">
<sdds:Obs TIME_PERIOD="1991-01" OBS_VALUE="79.7534465342489" OBS_STATUS="A"/>
<sdds:Obs TIME_PERIOD="1991-02" OBS_VALUE="71.8659035408967" OBS_STATUS="A"/>



<sdds:Obs TIME_PERIOD="2022-05" OBS_VALUE="96.4419659191275" OBS_STATUS="A"/>
<sdds:Obs TIME_PERIOD="2022-06" OBS_VALUE="95.1748194208808" OBS_STATUS="A"/>
<sdds:Obs TIME_PERIOD="2022-07" OBS_VALUE="90.08821804515" OBS_STATUS="A"/>
</sdds:Series>
</sdds:DataSet>
</CompactData>
<!--  i: 1403 : 1660204311  -->

The above example shows an excerpt of the main economic forward-looking indicator (FLI) for Switzerland, the KOF Economic Barometer, represented in an SDMX XML file. Besides the value and the date index, several attributes provide the consumer with an elaborate data description. In addition, other nodes and their children provide information like Contact or ID in the very same file. Note that modern browsers often provide code folding for nodes and highlighting to improve readability.

6.2.3 A Word on Binaries

Unlike all file formats discussed above, binaries cannot be read by humans using a simple text editor. In other words, you will need the software that wrote the binary to read it again. If that software was expensive and/or exotic, your work is much less accessible, more difficult to share and harder to reproduce. Though this disadvantage of binaries is mitigated when you use freely available open source software, storing data in binaries can still be a hurdle.

But, of course, binaries do have advantages, too: binaries can compress their content and save space. Binaries can take on all sorts of in-memory objects including functions, not just datasets. In other words, binaries can bundle stuff. Consider the following load/save operation in R:

bogus <- function(a,b){
  a + b
} 

data(Airpassengers)
data(mtcars)

s <- summary(mtcars)

save("bogus", "Airpassengers","s", file="bundle.RData")

In memory, bogus is a function, Airpassengers is an R time series object and s is a list based summary object. All of these objects can be stored in a single binary RData file using save(). A fresh R session can now load() everything stored in that file.

load("bundle.RData")

Notice that unlike reading a .csv or .json file, the call does not make any assignments into a target object. This is because all objects are loaded into an R environment (.globalEnv by default) with their original names.

6.2.4 Interoperable File Formats

Interoperable file formats cover some middle ground between the options described above. The cross-language in-memory development platform Apache Arrow is a well established project that also implements file formats that work across many popular (data science) environments. Though the major contribution of the Apache Arrow project is to allow sharing in-memory data store across environments, I will just show it as an example for interoperable file formats here. Nevertheless, if you’re interested in a modern, yet established cross environment data science project, digging deeper into Apache Arrow is certainly a fun experience.

From the Apache Arrow documentation:

library(dplyr)
library(arrow)
data("starwars")
file_path_sw <- "starwars.parquet"
write_parquet(starwars, file_path_sw)

The above R code writes the starwars demo dataset from the dplyr R package to a temporary .parquet file. The arrow R packages comes with the necessary tool set to write the open source columnar3 .parquet format. Though they are not text files, .parquet files can be read and written from different environments and consume the file written with R. The below code uses the arrow library for Python to read the file we have just written with R.

import pyarrow.parquet as pa
pa.read_table("starwars.parquet")
pyarrow.Table
name: string
height: int32
mass: double
hair_color: string
skin_color: string
eye_color: string
birth_year: double
sex: string
gender: string
homeworld: string
species: string
films: list<item: string>
  child 0, item: string
vehicles: list<item: string>
  child 0, item: string
starships: list<item: string>
  child 0, item: string
----
name: [["Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Organa",...,"Rey","Poe Dameron","BB8","Captain Phasma","Padmé Amidala"]]
height: [[172,167,96,202,150,...,null,null,null,null,165]]
mass: [[77,75,32,136,49,...,null,null,null,null,45]]
hair_color: [["blond",null,null,"none","brown",...,"brown","brown","none","unknown","brown"]]
skin_color: [["fair","gold","white, blue","white","light",...,"light","light","none","unknown","light"]]
eye_color: [["blue","yellow","red","yellow","brown",...,"hazel","brown","black","unknown","brown"]]
birth_year: [[19,112,33,41.9,19,...,null,null,null,null,46]]
sex: [["male","none","none","male","female",...,"female","male","none",null,"female"]]
gender: [["masculine","masculine","masculine","masculine","feminine",...,"feminine","masculine","masculine",null,"feminine"]]
homeworld: [["Tatooine","Tatooine","Naboo","Tatooine","Alderaan",...,null,null,null,null,"Naboo"]]
...

Here’s Julia reading our Parquet file:

using Parquet
sw = Parquet.File("starwars.parquet")
Parquet file: starwars.parquet
    version: 2
    nrows: 87
    created by: parquet-cpp-arrow version 9.0.0
    cached: 0 column chunks

When I composed this example reading and writing parquet files in different environments I ran into several compatibility issues. This shows that the level of interoperability is not at the same level as the interoperability of text files.

6.2.4.1 A Note on Overhead

The parquet format is designed to read and write files swiftly and to consume less disk space than text files. Both features can become particularly relevant in the cloud. Note though that parquet comes with some overhead, which may eat up gains if datasets are small. Consider our starwars dataset. At 87 rows and 14 columns, the dataset is rather small.

library(readr)
write_csv(starwars, file = "starwars.csv")
dim(starwars)
[1] 87 14
round(file.size("starwars.parquet") / file.size("starwars.csv"), digits = 2)
[1] 1.47

Hence, the overhead of a schema implementation and other meta information outweighs parquet’s compression for such a small dataset, leading to a parquet file that is almost 1.5 times larger than the corresponding csv file. Yet, parquet already turns the tables for the diamonds demo dataset from the ggplot2 R package, which is by no means a large dataset.

library(ggplot2)
data(diamonds)
write_csv(diamonds, file = "diamonds.csv")
write_parquet(diamonds, "diamonds.parquet" )
round(file.size("diamonds.parquet") / file.size("diamonds.csv"), digits = 2)
[1] 0.21

The parquet file for the diamonds dataset has roughly one fifth of the size of the corresponding text file. This is a great example of why there is not one single, perfect, one-size-fits all form of data that emerged from decades of information engineering. So when you choose how you are going to represent data in our project, think about your goals, your most common use or query and a smooth data transformation strategy for when the use cases or goals change.

6.3 Databases

Given the options that file based approaches provide, what is a) the difference and b) the added value of going for a database to manage data? The front-and-center difference is the client interface, but there are many more differences and benefits.

The above figure shows a terminal client which users use to send queries written in a query language. The client sends these queries to a database host and either performs an operation on the database quietly or returns a result. The most common example of such a query language is the Structured Query Language (SQL). This leads to a standard way of interaction with the data, no matter how the dataset looks like in terms of dimensions, size etc. SQL databases have been around much longer than data science itself, and continue to be inevitable as application backends and data archives for many use cases.

SELECT * FROM myschema.demotable

The above query would return all rows and all columns from a table called demotable in a schema called myschema. Such a query can easier be sent from a standalone database client, a database specific IDE with a built in client such as DataGrip or a programming language. Given the ubiquity of databases most basically any programming language has native interfaces to the most common database. And if that is not the case, there is the database management system agnostic ODBC standard that is supported by all major SQL databases. The below code shows how to connect from R to PostgreSQL, send queries from within R and receive results as R objects.

library(RPostgres)
con <- dbConnect(
  host = "localhost",
  user = "bugsbunny",
  passwd = .rs.AskForPassword("Enter Pw"), # only works within RStudio,
  dbname = "some_db_name"
)

# the result is an R data.frame
res <- dbSendQuery(con, "SELECT * FROM myschema.demotable")

# and we can do R things with it
# such as show the first 6 lines.
head(res)
dbDisconnect(con)

Obviously, the above example barely shows the tip of the iceberg as it is just meant to illustrate the way we interact with databases as opposed to a file system. To dig a little deeper into databases, I recommend getting a solid understanding of the basic CREATE, SELECT, INSERT, UPDATE, DELETE, TRUNCATE, DROP processes as well as basic JOINs and WHERE clauses. Also, it is helpful to understand the concept of normalization up to the third normal form.

6.3.1 Relational Database Management Systems (RDBMS)

When you need to pick a concrete database technology for your project, the first major choice is whether to go for a relational system or not. Unless you have a very compelling reason not to, you are almost always better off with a relational database: Relational databases are well established and accessible from any programming language used in programming with data that one could think of. In addition, modern RDBMS implementations offer many non-relational features such as JSON field types and operations.

I would classify the most popular relational database implementations as follows. First, there is SQLite. As the name suggestions, SQLite is a light-weight, stripped down, easy-to-use and install implementation.

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. – SQLite.org

SQLite data lives in a single file that the user queries through the SQLite engine. Here is an example using that engine from R.

library(RSQLite)
db_path <- "rse.sqlite3"
con <- dbConnect(RSQLite::SQLite(), db_path)
dbWriteTable(con, dbQuoteIdentifier(con,"mtcars"), mtcars, overwrite = T)
dbWriteTable(con, dbQuoteIdentifier(con,"flowers"), iris, overwrite = T)

The above code initiates a SQLite database and continues to write the built-in R demo datasets into separate tables in that newly created database. Now we can use SQL to query the data. Return the first three rows of flowers:

dbGetQuery(con, "SELECT * FROM flowers LIMIT 3")
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Return cars that are more fuel efficient than 30 miles per gallon:

dbGetQuery(con, "SELECT * FROM mtcars WHERE mpg > 30")
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

MySQL can do a little more and also immensely popular, particularly as a database backend for web content management systems and other web-based applications. The so-called LAMP stack (Linux, Apache, MySQL and PHP) contributed to its rise decades ago when it fueled many smaller and medium level web projects around the world. In its early days, MySQL used to be an independent open source project, but was later on acquired by database Juggernaut Oracle as a lite version to go with its flagship product.

While certainly doing its job in millions of installations, MySQL it is not in at the same level as Microsoft SQL Server (MSSQL), PostgreSQL and Oracle Database and I suggest one of these three enterprise level databases as data store for research projects that go beyond hosting a blog. Especially when it comes to long-term conservation of data and enforcing consistency, MSSQL, PostgreSQL and Oracle are hard to beat. Among the three, personally I would always lean towards the license cost free open source PostgreSQL, but fitting into existing ecosystems is a good reason to go with either of MSSQL or Oracle if you can afford the licenses. For many use cases, there is hardly any difference for the analyst or scientific end user. PostgreSQL may have the coolest spatial support, MSSQL T-SQL dialect may have some extra convenient queries if your developers mastered the dialect and Oracle may have the edge in performance and Java interaction here and there, but none of these systems is a bad choice.

Another database management that gets a lot of attention recently (and rightfully so) is DuckDB. Because it is mentioned so overwhelmingly positive and often, it is important to understand what it is and when to use it. DuckDB is not yet another competitor that tries to gain some ground from the big three of MSSQL, PostgreSQL and Oracle. DuckDB does offer an SQL interface, but is very different in its aims from the traditional SQL databases. DuckDB is serverless and allows accessing parquet files via a very fast SQL interface. This makes DuckDB a great tool for interactive analysis and transfer of large result sets, but not so suitable for enterprise data warehousing.

6.3.2 A Word on Non-Relational Databases

Among other things, relational databases are ACID (Atomicity, Consistency, Isolation, and Durability) compliant and ask for very little in return to provide us with a framework to keep our data quality high for decades. So unless, you have a very specific use case that translates to a compelling reason to use a non-relational database – stick to SQL. Document oriented storage or very unstructured information could be such a reason to use a non-relational database, yet their JSON support allows to also handle JSON in database cells. About a decade ago, mongoDB gained traction, partly piggybacking the success of JavaScript and server-side JavaScript in particular. In web development the MEAN (mongoDB, expressjs, angular and node) stack become popular and with the bundle, the idea of non-relational databases as fast track to a backend spread.

Columnar stores, which are also considered non-relational, are conceptional very similar to relational databases, though denormalized and designed to structure sparse data. Database systems like Apache Cassandra are designed to scale horizontally and be highly available, managing massive amounts of data. Cloud applications that distribute data across multiple nodes for high availability benefit from such an approach. Other options include Redis or Couchbase. If you are not happy with the ‘beyond-the-scope-of-this-book’ argument, blogging experts like Lukas Eder maybe biased but much better educated (and fun) to educate you here.

6.4 Non-Technical Aspects of Managing Data

The fact that we can do more data work single-handedly than ever before does not only equate to more options. It also means we need to be aware of new issues and responsibilities. Those responsibilities range from leading by example when it comes to etiquette and ethical aspects, to sticking to privacy rules and complying with security standards. In addition, to normative restrictions that come with handling data, the options and choices of data dissemination are a realm of its own. Just like software publications, you should not just ‘drop’ data without a license and instructions on acceptable use of the data.

6.4.1 Etiquette

Just because content is publicly available on a website, it does not automatically mean that bulk downloads, aggregation and re-publishing are ok. For example, the ability to scrape a website daily and doing so with good intent for science does not mean a website’s AUP (Acceptable Use Policy) allows to systematically archive its content.

Be responsible when scraping data from websites by following polite principles: introduce yourself, ask for permission, take slowly and never ask twice. –CRAN description of the polite R package.

In other words, the new type of researcher discussed in this book needs to be aware of potential legal and social consequences. The {polite} R package quoted above is an example of an alternative approach that favors etiquette over hiding IP addresses to avoid access denial.

6.4.2 Security

“I ain’t got nothing to hide. I don’t care about people reading my stuff.” I’ve heard similar ‘arguments’ a zillion times. People who argue like that do not only endanger their environment, they contribute to a less secure internet at large, as they leave their device open to contribute to malicious activity.

It does not have to be you who has been trusted to work with data that qualify as sensitive. Someone on your team or within your network might have been, and that person may trust you more than a stranger and be less vigilant when harmful actions are executed in your name. This is why you are obliged to care about security. As in, do NOT store your credentials in your scripts. As in, passwords are NOT part of your analysis. You may accidentally push your password to GitHub, where it is not only publicly available but also hard to delete for beginners. Make sure to choose secure passwords. Use a password manager, so the passwords do not need to be your cat’s name for you to remember. Use a password manager so you can afford not to have the same passwords for multiple applications. Use a password manager so you can afford to change your passwords. Use key files where you can. The case study chapter gives you a hands-on recipe to use RSA key pairs to connect to remote servers, e.g., your GitHub account instead of a username / password combination. This is also a good way to connect to a remote server via ssh.

In addition to the above brainy smurf advice, let me mention security as a reason to consider using a database to manage and archive your data for the long haul. Enterprise level databases allow for granular access management and help to stay ahead of users and their rights regarding your database’s entries.

6.4.3 Privacy

Privacy in data science is a complex issue and could legitimately fill a book on its own. Though I cannot comprehensively cover privacy in a section, it is important to me to raise awareness and hopefully create an entry point to the matter. When working with data, in its essence, respecting privacy is about avoiding exposure of individual units without their explicit prior consent. It is important to understand that exposure does not stop at names. A single extraordinary feature or an exotic combination of features can identify an individual within a dataset, or at least expose a group. This is why merging multiple datasets may also cause privacy concerns when datasets were not created to be merged in the first place and/or individuals were not aware that merging was possible. So, what we can we as researchers learn from here, except from concerns and further complication of our work? First, licenses and usage policies are a service to users of the data. Second, awareness of what is sensitive data is a skill that is valuable to have on a team. That being said, management of in-depth knowledge is rather easy to organize in centralized fashion. Most universities and larger corporations will have an officer to run these things by.

6.4.4 Data Publications

Yet, there is more to managing your data’s exposure than just making sure everything is encrypted and locked up. Publication of data makes your results reproducible and improves trust in said results. As a consequence there is a notable crescendo in the claim for reproducible research. While reproducible research is great, I would like to raise awareness that essentially all solutions created and advertised to improve reproducibilty implicitly assume the researcher deals with datasets obtained through a study. In other words, it is implied that your work is not about monitoring an ongoing process.

Data Archives

Research repositories like Zenodo that allow to archive data follow a snapshot thinking: A catalog entry refers to a particular version of an research paper, report, software or dataset. Whenever there is an update a new version is added to a catalog entry. Adding datasets or software publications to catalogs like Zenodo does not only improve reproducibility it also helps data workers get credit for their contribution. Fortunately, feeding research repositories is a task that is easy to automate thanks great integration, APIs and community work.

Open Data

Open Data archhives are special form of repositories. The term open data refers to publicly available data that are available free of license costs and in machine-readable fashion.

“Open data and content can be freely used, modified, and shared by anyone for any purpose” – opendefition.org

Because of approaches such as the Swiss governments’ “open by default”, open government data (OGD) has become a common form of open data. The basic idea that data generated from publicly funded processes should be publicly available whenever no privacy rights are violated has been the motor for many OGD projects out of public administration. From small local governments to international organizations like the World Bank or OECD, open data have become a valuable and growing source for researchers. Open data initiatives of your country, as well as major international organizations, will help you to create interoperable datasets. In addition, open data organizations provide you with a publication channel for your data or with a data catalog that stores the data description. Initiatives like SDMX (Statistical Data and Meta Exchange) aim to improve exchange of data and data descriptions. Their XML-based format has become an international standard which led to the implementation of SDMX read and write routines in statistical software. Whether you think about the conditions of your own data publications or about a source for your own research project, make sure to consider open data. The openwashdata project I have been contributing to might be an hands-on inspiration to get a more concrete understanding of what open data is actually about. Among other things, the openwashdata project collects datasets from publications and re-publishes them in in machine readable format alongside their meta information. The data manipulation to reach this end is documented in reproducible fashion. The final results, R data packages, are published in freely available git repositories in the project’s GitHub organization.

“openwashdata is an active global community that applies FAIR principles to data generated in the greater water, sanitation, and hygiene (WASH) sector” – openwashdata.org


  1. Note that commas are not always necessarily the separator in .csv files. Because of the use of commas as decimal delimiters in some regions, columns are also often separated by semicolons to avoid conflicts.↩︎

  2. SDMX.org about SDMX: SDMX, which stands for Statistical Data and Metadata eXchange is an international initiative that aims at standardizing and modernizing (“industrializing”) the mechanisms and processes for the exchange of statistical data and metadata among international organizations and their member countries. SDMX is sponsored by seven international organizations including the Bank for International Settlements (BIS), the European Central Bank (ECB), Eurostat (Statistical Office of the European Union), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), the United Nations Statistical Division (UNSD), and the World Bank↩︎

  3. see also↩︎