Some history of open-source software and data storage for data science

Douglas Bates

University of Wisconsin - Madison

2024-12-05

“What a long, strange trip it’s been.” (The Grateful Dead)

Open-source and data analysis

It’s amazing to look back on the development of open-source tools for statistical computing and data science.
It wasn’t always clear that this path would be successful, but it has been — more than we could ever have imagined.
I will primarily discuss the development of S and R in the 1980’s and 1990’s, trying to put it in context.
Wikipedia links are given throughout these slides. It’s generally a good source for background on technical topics.

My background

Created custom research programs in 1970’s, early experience with S in 1980’s, then R in 1990’s and beyond
Member of R-Core from 1997 until 2023
Know Base-R pretty well, not a tidyverse expert
Some experience with Python in 1990’s, early 2000’s for web-site development
Predominantly working in Julia for the last decade

Early days

Hardware of the 1960’s and 1970’s

Most computing was on “mainframe” computers that occupied entire rooms.
A university might have a single computer, often an IBM System/360 (360 → “all round” computer).
Permanent data storage was limited to 9-track tape, only allowing sequential processing of records.
Biggest research use at universities was often batch-oriented statistical software such as BMDP (1965), SPSS (1968), and SAS (1972).

Hardware came with operating system and compilers.

Cobol, for commercial applications, and Fortran, for scientific computing, dated from the 1950’s
IBM pushed PL/I very hard, but not successfully
- initial versions of SAS were written in PL/I, Fortran and (IBM) assembler
- in 1985 SAS was re-written in C so it could be ported to microcomputers

Bell Labs, Unix, and S

The Unix Operating System

Many dramatic changes in computing environments in 70’s and 80’s came from AT&T Bell Labs
- The Unix (1971, 1973) operating system for minicomputers (DEC PDP/11) and later hardware (DEC VAX/11, workstations)
- C (1972) and C++ (1985) programming languages
- S (1976, 1980) language for statistical computing and graphics

Software tools

The Unix philosophy was centered around “software tools” as described in books by Brian Kernighan and his co-authors.
The approach emphasized functions and pipelines or composability instead of subroutines.
The original versions of S (aka “Old S”) followed this philosophy, using Ratfor and m4 to glue together algorithms and self-describing data objects.

Statistical modeling project

In the mid-80’s John Chambers create a new language QPE with representations for both code and data. Later this became the "new" S language.
The “Statistical Models in S” project aimed for a consistent approach to many different types of statistical models using classes and (S3) methods implemented as attributes.
Much of the work was done in the late 80’s but the book appeared in 1991.
System for data analysis and graphics (Bill Cleveland’s influence).

What set S apart?

Self-describing data structures, including recursive structures (lists).
Code expressed as functions and function calls.
Interaction through a REPL (read-eval-print-loop) or inclusion of program files.
Code was a first-class object - functions could generate expressions or other functions.
Complexity (e.g. S3 classes) was often achieved through attributes.

Objects in S were in-memory

SAS, SPSS, etc. worked as filters on external row-oriented tables in files
S was based on vectors (including factors or ordered factors)
- a data.frame was a named, ordered collection of vectors
“columnar” views of data tables, as opposed to row-oriented approaches (including CSV), allow more flexibility in data science.
Many analysts, betting against technological progress, felt that S could not succeed because it couldn’t handle “big data”.

S outside Bell Labs

Because AT&T was a regulated monopoly (long-distance telephone lines), they could not also sell software.
They could share software for research purposes. A small number of statistics depts were designated “beta test sites” for S.
A couple of times a year we would receive a 9-track tape with the software, a couple pages of instructions, and their best wishes.

Commercial versions of S

Porting S to Windows, or to the many different versions of Unix on workstations from DEC, HP, IBM, Sun, etc. was difficult, time-consuming and required skilled personnel.
S-PLUS (1988) was created and marketed by a spin-off, initially called StatSci, of the U. of Washington Stats Dept.
They licensed the base S code from Bell Labs and provided their own environment and packages.
Data graphics was a strong selling point.

Other early technology developments

Personal computers

In the late ’70s and through the 80’s early microcomputers (Apple II, IBM PC (1981), Apple Macintosh were introduced but did not have the power to compile and run statistical software.
By 1985 their potential was sufficient to cause SAS Institute to re-write their software in C

Free (Open-Source) Software

The FSF (1985) was formed to promote free software, especially the emacs editor and the GNU system, using licenses like the GPL to prevent commercial capture.
The Linux (1991) kernel in conjunction with GNU software provided a functioning Open-Source computer system, often used for web-servers on the newly-developed Internet in the L.A.M.P. (Linux, Apache, MySQL, PHP/Perl/Python) bundle.
In combination with version control systems such servers allowed for distributed development and dissemination of software.

Peculiar economics of free software

Especially at universities, setting up a web server was not horribly difficult. Some maintenance was required but not an overwhelming load.
The first copy of software was expensive (in people time) to produce. Subsequent copies were essentially free.
Support could be “crowd sourced” through email lists, wikis, etc.

The R Project

Initial implementation

Ross Ihaka and Robert Gentleman created a language that was “not unlike S” at U. of Auckland, initially released in 1993.
The design of the language itself followed that of S, internally it was more influenced by Scheme.
One important internal difference relative to S was in memory allocation and garbage collection.
- S used memory-mapping

Building community

Martin Mächler made many early contributions and in 1995 convinced “R & R” to release the source code under the GNU General Public Licence (GPL).
Other statisticians, especially those already using S, began to contribute.
Mailing lists, CRAN (source code and package repository), and R-Core were all created in 1997.
R version 1.0.0 was released on Feb. 29, 2000 - major milestone was implementation of S4 classes and methods (multiple dispatch).

Extensibility

The goal of the S language was to allow users to progress through stages of use
- use REPL as sophisticated calculator
- write scripts for repeated operations
- replace repeated operations with function definitions
- gather functions, data, and documentation into a local package
- publish a package for use by others
CRAN, and other repositories like Bioconductor, became clearing houses of research and methodology and examples. Quality control and consistency of results are important.
“literate programming” (e.g. Sweave, knitr, Quarto) provided reproducible report generation.
Jupyter notebooks allow interactive evaluation in documents.

Contrast with commercial model

Most commercial software does not encourage extension by users.
Eric Raymond’s book The Cathedral and the Bazaar describes the differences in approach of commercial and open-source software.
Commercial statistical software emphasized GUIs, which are fine to start with but ultimately restrictive.

Commercial support

Posit.co, formerly RStudio.com, is a public benefit corporation that independently developed
- the RStudio IDE and the multi-language Positron data science IDE.
- knitr and the Quarto document preparation system
- the tidyverse family of R packages for data science
To many users Posit.co is more central to R than is R-Core.
It is a separate entity and a commercial enterprise but complementary to and supportive of the R Project.

Where we are now

Powerful open-source tools

R and Python very widely-used for data science.
Julia has a smaller, but dedicated, following.
Each language is supported by thousands of packages (>500,000 on pypi.org, >20,000 on CRAN, >10,000 on Julia reg)
Even Microsoft supports Open-Source through github.com, VSCode, and other efforts.
The combinations of languages and packages provide incredible opportunities but so much to learn and remember.

Each language can be used

in a REPL (read-eval-print-loop)
through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)
in document-creation systems like Quarto
across many platforms from laptops to cloud servers to supercomputers
in conjunction with version control systems like git and internet repositories like github or gitlab

Open-source wasn’t a sure thing

In the 1990’s open-source was still considered kind-of “iffy”. Our efforts on R were dismissed as “freeware” or as the “student edition” of S-PLUS.
Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).
Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.

Bazaar - accessible but confusing

It is wonderful to have several different languages and systems with which to do data science.
Teams may need to use multiple languages.
In any case, it is good to have exposure to more than one way to do an analysis.
But few people have the time or capacity to learn to use all these languages effectively.
In some ways it is a matter of balancing time to learn tools versus time spent using tools.

Reliable data interchange

In a polyglot data science world it becomes important to allow different team members to collaborate across languages.
Arrow is an explicitly defined, column-oriented format for in-memory data tables or external files.
Allows for missing data & flags columns with none.
Allows for Dictionary Encoding, i.e. factor-like structures
Memory image is similar to in-memory vector representations in Pandas, Polars, R, and Julia.
Files can be compressed or memory-mapped.