University of Wisconsin - Madison
2024-12-05
It’s amazing to look back on the development of open-source tools for statistical computing and data science.
It wasn’t always clear that this path would be successful, but it has been — more than we could ever have imagined.
I will primarily discuss the development of S and R in the 1980’s and 1990’s, trying to put it in context.
Wikipedia links are given throughout these slides. It’s generally a good source for background on technical topics.
Created custom research programs in 1970’s, early experience with S in 1980’s, then R in 1990’s and beyond
Member of R-Core from 1997 until 2023
Know Base-R pretty well, not a tidyverse expert
Some experience with Python in 1990’s, early 2000’s for web-site development
Predominantly working in Julia for the last decade
Most computing was on “mainframe” computers that occupied entire rooms.
A university might have a single computer, often an IBM System/360 (360 → “all round” computer).
Permanent data storage was limited to 9-track tape, only allowing sequential processing of records.
Biggest research use at universities was often batch-oriented statistical software such as BMDP (1965), SPSS (1968), and SAS (1972).
Many dramatic changes in computing environments in 70’s and 80’s came from AT&T Bell Labs
The Unix philosophy was centered around “software tools” as described in books by Brian Kernighan and his co-authors.
The approach emphasized functions and pipelines or composability instead of subroutines.
The original versions of S (aka “Old S”) followed this philosophy, using Ratfor and m4 to glue together algorithms and self-describing data objects.
In the mid-80’s John Chambers create a new language QPE
with representations for both code and data. Later this became the "new" S language
.
The “Statistical Models in S” project aimed for a consistent approach to many different types of statistical models using classes and (S3) methods implemented as attributes
.
Much of the work was done in the late 80’s but the book appeared in 1991.
System for data analysis and graphics (Bill Cleveland’s influence).
Self-describing data structures, including recursive structures (lists).
Code expressed as functions and function calls.
Interaction through a REPL (read-eval-print-loop) or inclusion of program files.
Code was a first-class object - functions could generate expressions or other functions.
Complexity (e.g. S3 classes) was often achieved through attributes
.
SAS, SPSS, etc. worked as filters
on external row-oriented tables in files
S was based on vectors (including factors or ordered factors)
data.frame
was a named, ordered collection of vectors“columnar” views of data tables, as opposed to row-oriented approaches (including CSV), allow more flexibility in data science.
Many analysts, betting against technological progress, felt that S could not succeed because it couldn’t handle “big data”.
Because AT&T was a regulated monopoly (long-distance telephone lines), they could not also sell software.
They could share software for research purposes. A small number of statistics depts were designated “beta test sites” for S.
A couple of times a year we would receive a 9-track tape with the software, a couple pages of instructions, and their best wishes.
Porting S to Windows, or to the many different versions of Unix on workstations from DEC, HP, IBM, Sun, etc. was difficult, time-consuming and required skilled personnel.
S-PLUS (1988) was created and marketed by a spin-off, initially called StatSci
, of the U. of Washington Stats Dept.
They licensed the base S code from Bell Labs and provided their own environment and packages.
Data graphics was a strong selling point.
In the late ’70s and through the 80’s early microcomputers (Apple II, IBM PC (1981), Apple Macintosh were introduced but did not have the power to compile and run statistical software.
By 1985 their potential was sufficient to cause SAS Institute to re-write their software in C
The FSF (1985) was formed to promote free software, especially the emacs editor and the GNU system, using licenses like the GPL to prevent commercial capture.
The Linux (1991) kernel in conjunction with GNU software provided a functioning Open-Source computer system, often used for web-servers on the newly-developed Internet in the L.A.M.P. (Linux, Apache, MySQL, PHP/Perl/Python) bundle.
In combination with version control systems such servers allowed for distributed development and dissemination of software.
Especially at universities, setting up a web server was not horribly difficult. Some maintenance was required but not an overwhelming load.
The first copy of software was expensive (in people time) to produce. Subsequent copies were essentially free.
Support could be “crowd sourced” through email lists, wikis, etc.
Ross Ihaka and Robert Gentleman created a language that was “not unlike S” at U. of Auckland, initially released in 1993.
The design of the language itself followed that of S, internally it was more influenced by Scheme.
One important internal difference relative to S was in memory allocation and garbage collection.
Martin Mächler made many early contributions and in 1995 convinced “R & R” to release the source code under the GNU General Public Licence (GPL).
Other statisticians, especially those already using S, began to contribute.
Mailing lists, CRAN (source code and package repository), and R-Core were all created in 1997.
R version 1.0.0 was released on Feb. 29, 2000 - major milestone was implementation of S4 classes and methods (multiple dispatch).
The goal of the S language was to allow users to progress through stages of use
CRAN, and other repositories like Bioconductor, became clearing houses of research and methodology and examples. Quality control and consistency of results are important.
“literate programming” (e.g. Sweave, knitr, Quarto) provided reproducible report generation.
Jupyter notebooks allow interactive evaluation in documents.
Most commercial software does not encourage extension by users.
Eric Raymond’s book The Cathedral and the Bazaar describes the differences in approach of commercial and open-source software.
Commercial statistical software emphasized GUIs, which are fine to start with but ultimately restrictive.
Posit.co, formerly RStudio.com
, is a public benefit corporation that independently developed
the RStudio IDE and the multi-language Positron data science IDE.
the tidyverse family of R packages for data science
To many users Posit.co is more central to R
than is R-Core
.
It is a separate entity and a commercial enterprise but complementary to and supportive of the R Project.
Julia has a smaller, but dedicated, following.
Each language is supported by thousands of packages (>500,000 on pypi.org, >20,000 on CRAN, >10,000 on Julia reg)
Even Microsoft supports Open-Source through github.com, VSCode, and other efforts.
The combinations of languages and packages provide incredible opportunities but so much to learn and remember.
in a REPL (read-eval-print-loop)
through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)
in document-creation systems like Quarto
across many platforms from laptops to cloud servers to supercomputers
in conjunction with version control systems like git and internet repositories like github or gitlab
In the 1990’s open-source was still considered kind-of “iffy”. Our efforts on R were dismissed as “freeware” or as the “student edition” of S-PLUS.
Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).
Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.
It is wonderful to have several different languages and systems with which to do data science.
Teams may need to use multiple languages.
In any case, it is good to have exposure to more than one way to do an analysis.
But few people have the time or capacity to learn to use all these languages effectively.
In some ways it is a matter of balancing time to learn tools versus time spent using tools.
In a polyglot data science world it becomes important to allow different team members to collaborate across languages.
Arrow is an explicitly defined, column-oriented format for in-memory data tables or external files.
Allows for missing data & flags columns with none.
Allows for Dictionary Encoding, i.e. factor-like structures
Memory image is similar to in-memory vector representations in Pandas, Polars, R, and Julia.
Files can be compressed or memory-mapped.
https://dmbates.quarto.pub/RGovys.pdf