Cross-language tools for statistics and data science

Douglas Bates

University of Wisconsin - Madison

Overview

Good news: Availability of several powerful, well-supported languages and environments for data science.
Bad news: Need to learn (and remember) slightly different function names, syntax, and support tools.
Tools that apply to multiple languages can help reduce cross-talk and cognitive load.

Arrow binary storage format for data tables - accessible from many languages
Quarto document preparation system (supported by the newly renamed Posit.co, formerly RStudio).
VSCode editor and extensions

three brief write-ups at dmbates.quarto.pub, reading and saving the collisions data.
shout-out to Posit.co for providing quarto.pub, which is incredibly easy to use

R and Python very widely-used for data science.
Julia has a smaller, but dedicated, following.
In the next session each of these languages will be demonstrated by an expert.
Each language is supported by thousands of packages.
The combinations of languages and packages provide incredible opportunities but so much to learn and remember.

in a REPL (read-eval-print-loop)
through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)
in document-creation systems like Quarto
across many platforms from laptops to cloud servers to supercomputers
in conjunction with version control systems like git and internet repositories like github or gitlab

In the 1990’s, R (based on the earlier language, S) pioneered Open Source data analysis
- Open Source was still considered kind-of “iffy”. Our efforts were dismissed as “freeware” or as the “student edition” of S-PLUS.
- Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).
- Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.

It is wonderful to have several different languages and systems with which to do data science.
Teams can need to use multiple languages.
In any case, it is good to have exposure to more than one way to do an analysis.
But few people have the time or capacity to learn to use all these languages effectively.
In some ways it is a matter of balancing time to learn tools versus time spent using tools.

We like to think that model-building and visualization techniques are our big “value-added”.
In the real world a large part of our time is spend wrangling the data into a usable form.
Often the data come to us in a human-readable form, like CSV (e.g. nyc_mv_collisions_202201.csv). Reading it & transforming/cleaning the results is expensive, tedious and error-prone.
Once you have a satisfactory version, save it in a binary form with associated metadata.

An apache.org project with implementations in many languages.
Most ‘high-level’ language implementations are based on the C++ library (Julia and Rust are exceptions).
Python/Pandas developers were early advocates
Column-oriented format for flat and hierarchical data
The C++ library provides analytics capability in addition to data manipulation

Most data science activities produce some kind of report or dashboard or website or book or …
R users often use RMarkdown or knitr documents or Shiny interactive pages using Posit (formerly RStudio) tools.
Python users may be more familiar with Jupyter notebooks and the jupyterlab environment.
Both systems allow for “reproducible research” where output, tables and figures are created from embedded source code.

Quarto is a reimplementation and generalization of the RMarkdown/knitr approach for documents
Its development is supported by Posit (formerly RStudio).
Open Source (https://github.com/quarto-dev/quarto-cli/), it builds on pandoc
Code execution can be through knitr for R and conversion to Jupyter for Python and Julia
Equations, citations, cross-refs, callouts, and more

There are many documentation sites and presentations about Quarto
For example, a recent presentation by Isabella Velásquez has much better slides than these.
Also check out the Awesome Quarto repository.

I am not involved in Quarto development; I am simply a satisfied and grateful user.
Our in-development book Embrace Uncertainty describes mixed-effects models and a Julia package for fitting them.
Our Julia for Data Science workshop used a website built with Quarto.

The RStudio IDE is widely used by those working with R (and Python?)
It seems that the Quarto visual editor is the same as the RStudio IDE at present
Jupyterlab is another alternative
VS Code is actively developed

A MicroSoft product with community support
Specialized by “languageserver” and syntax implementations for many systems (e.g. Quarto)
Very good git support (Microsoft owns github)
Very good ssh support (a priority for Microsoft)
Editing, running shells, etc. on a remote computer can be as easy as on your local machine.

We have many wonderful tools for data science available to us.
Time to learn such tools is at a premium.
It is good to be aware of the existence of tools, even if you can’t take time to learn them right now.
I have offered 3 such tools to consider.