Cross-language tools for statistics and data science

Douglas Bates

University of Wisconsin - Madison

Overview

It was the best of times, it was …

  • Good news: Availability of several powerful, well-supported languages and environments for data science.

  • Bad news: Need to learn (and remember) slightly different function names, syntax, and support tools.

  • Tools that apply to multiple languages can help reduce cross-talk and cognitive load.

Will describe three such tools

  • Arrow binary storage format for data tables - accessible from many languages

  • Quarto document preparation system (supported by the newly renamed Posit.co, formerly RStudio).

  • VSCode editor and extensions

My background

  • early experience with S in 1980’s, then R in 1990’s and beyond
  • member of R-Core since 1997 (no longer active in development of R)
  • know Base-R pretty well, not a tidyverse expert
  • some experience with Python in 1990’s, early 2000’s for web-site development
  • predominantly working in Julia for the last decade

Demo of Arrow and Quarto

  • three brief write-ups at dmbates.quarto.pub, reading and saving the collisions data.
  • shout-out to Posit.co for providing quarto.pub, which is incredibly easy to use

Powerful open-source tools

  • R and Python very widely-used for data science.
  • Julia has a smaller, but dedicated, following.
  • In the next session each of these languages will be demonstrated by an expert.
  • Each language is supported by thousands of packages.
  • The combinations of languages and packages provide incredible opportunities but so much to learn and remember.

Each language can be used

  • in a REPL (read-eval-print-loop)
  • through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)
  • in document-creation systems like Quarto
  • across many platforms from laptops to cloud servers to supercomputers
  • in conjunction with version control systems like git and internet repositories like github or gitlab

Open Source and a REPL were key

  • In the 1990’s, R (based on the earlier language, S) pioneered Open Source data analysis
    • Open Source was still considered kind-of “iffy”. Our efforts were dismissed as “freeware” or as the “student edition” of S-PLUS.
    • Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).
    • Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.

Bazaar - accessible but confusing

  • It is wonderful to have several different languages and systems with which to do data science.
  • Teams can need to use multiple languages.
  • In any case, it is good to have exposure to more than one way to do an analysis.
  • But few people have the time or capacity to learn to use all these languages effectively.
  • In some ways it is a matter of balancing time to learn tools versus time spent using tools.

Re-express the data table(s)

  • We like to think that model-building and visualization techniques are our big “value-added”.
  • In the real world a large part of our time is spend wrangling the data into a usable form.
  • Often the data come to us in a human-readable form, like CSV (e.g. nyc_mv_collisions_202201.csv). Reading it & transforming/cleaning the results is expensive, tedious and error-prone.
  • Once you have a satisfactory version, save it in a binary form with associated metadata.

The Arrow format

  • An apache.org project with implementations in many languages.
  • Most ‘high-level’ language implementations are based on the C++ library (Julia and Rust are exceptions).
  • Python/Pandas developers were early advocates
  • Column-oriented format for flat and hierarchical data
  • The C++ library provides analytics capability in addition to data manipulation

As we have seen

  • The Arrow format is promising but not yet foolproof.

From data to report

  • Most data science activities produce some kind of report or dashboard or website or book or …
  • R users often use RMarkdown or knitr documents or Shiny interactive pages using Posit (formerly RStudio) tools.
  • Python users may be more familiar with Jupyter notebooks and the jupyterlab environment.
  • Both systems allow for “reproducible research” where output, tables and figures are created from embedded source code.

Enter Quarto

  • Quarto is a reimplementation and generalization of the RMarkdown/knitr approach for documents
  • Its development is supported by Posit (formerly RStudio).
  • Open Source (https://github.com/quarto-dev/quarto-cli/), it builds on pandoc
  • Code execution can be through knitr for R and conversion to Jupyter for Python and Julia
  • Equations, citations, cross-refs, callouts, and more

Check the documentation

My experience

  • I am not involved in Quarto development; I am simply a satisfied and grateful user.

  • Our in-development book Embrace Uncertainty describes mixed-effects models and a Julia package for fitting them.

  • Our Julia for Data Science workshop used a website built with Quarto.

Editing source files

VS Code

  • A MicroSoft product with community support
  • Specialized by “languageserver” and syntax implementations for many systems (e.g. Quarto)
  • Very good git support (Microsoft owns github)
  • Very good ssh support (a priority for Microsoft)
  • Editing, running shells, etc. on a remote computer can be as easy as on your local machine.

Conclusions

  • We have many wonderful tools for data science available to us.
  • Time to learn such tools is at a premium.
  • It is good to be aware of the existence of tools, even if you can’t take time to learn them right now.
  • I have offered 3 such tools to consider.