Cross-language tools for statistics and data science
Douglas Bates
University of Wisconsin - Madison
It was the best of times, it was …
Good news: Availability of several powerful, well-supported languages and environments for data science.
Bad news: Need to learn (and remember) slightly different function names, syntax, and support tools.
Tools that apply to multiple languages can help reduce cross-talk and cognitive load.
My background
- early experience with
S
in 1980’s, then R
in 1990’s and beyond
- member of R-Core since 1997 (no longer active in development of
R
)
- know Base-R pretty well, not a tidyverse expert
- some experience with Python in 1990’s, early 2000’s for web-site development
- predominantly working in Julia for the last decade
Demo of Arrow and Quarto
- three brief write-ups at dmbates.quarto.pub, reading and saving the collisions data.
- shout-out to Posit.co for providing
quarto.pub
, which is incredibly easy to use
Each language can be used
- in a REPL (read-eval-print-loop)
- through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)
- in document-creation systems like Quarto
- across many platforms from laptops to cloud servers to supercomputers
- in conjunction with version control systems like git and internet repositories like github or gitlab
Open Source and a REPL were key
- In the 1990’s,
R
(based on the earlier language, S
) pioneered Open Source data analysis
- Open Source was still considered kind-of “iffy”. Our efforts were dismissed as “freeware” or as the “student edition” of S-PLUS.
- Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).
- Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.
Bazaar - accessible but confusing
- It is wonderful to have several different languages and systems with which to do data science.
- Teams can need to use multiple languages.
- In any case, it is good to have exposure to more than one way to do an analysis.
- But few people have the time or capacity to learn to use all these languages effectively.
- In some ways it is a matter of balancing time to learn tools versus time spent using tools.
Re-express the data table(s)
- We like to think that model-building and visualization techniques are our big “value-added”.
- In the real world a large part of our time is spend wrangling the data into a usable form.
- Often the data come to us in a human-readable form, like CSV (e.g.
nyc_mv_collisions_202201.csv
). Reading it & transforming/cleaning the results is expensive, tedious and error-prone.
- Once you have a satisfactory version, save it in a binary form with associated metadata.
As we have seen
- The Arrow format is promising but not yet foolproof.
From data to report
- Most data science activities produce some kind of report or dashboard or website or book or …
R
users often use RMarkdown
or knitr
documents or Shiny
interactive pages using Posit (formerly RStudio) tools.
Python
users may be more familiar with Jupyter
notebooks and the jupyterlab
environment.
- Both systems allow for “reproducible research” where output, tables and figures are created from embedded source code.
Enter Quarto
- Quarto is a reimplementation and generalization of the
RMarkdown/knitr
approach for documents
- Its development is supported by Posit (formerly
RStudio
).
- Open Source (https://github.com/quarto-dev/quarto-cli/), it builds on pandoc
- Code execution can be through
knitr
for R
and conversion to Jupyter for Python
and Julia
- Equations, citations, cross-refs, callouts, and more
My experience
I am not involved in Quarto development; I am simply a satisfied and grateful user.
Our in-development book Embrace Uncertainty describes mixed-effects models and a Julia package for fitting them.
Our Julia for Data Science workshop used a website built with Quarto.
VS Code
- A MicroSoft product with community support
- Specialized by “languageserver” and syntax implementations for many systems (e.g. Quarto)
- Very good git support (Microsoft owns github)
- Very good ssh support (a priority for Microsoft)
- Editing, running shells, etc. on a remote computer can be as easy as on your local machine.
Conclusions
- We have many wonderful tools for data science available to us.
- Time to learn such tools is at a premium.
- It is good to be aware of the existence of tools, even if you can’t take time to learn them right now.
- I have offered 3 such tools to consider.