Some history of open-source software and data storage for data science

Douglas Bates

University of Wisconsin - Madison

2024-12-05

“What a long, strange trip it’s been.” (The Grateful Dead)

Open-source and data analysis

  • It’s amazing to look back on the development of open-source tools for statistical computing and data science.

  • It wasn’t always clear that this path would be successful, but it has been — more than we could ever have imagined.

  • I will primarily discuss the development of S and R in the 1980’s and 1990’s, trying to put it in context.

  • Wikipedia links are given throughout these slides. It’s generally a good source for background on technical topics.

My background

  • Created custom research programs in 1970’s, early experience with S in 1980’s, then R in 1990’s and beyond

  • Member of R-Core from 1997 until 2023

  • Know Base-R pretty well, not a tidyverse expert

  • Some experience with Python in 1990’s, early 2000’s for web-site development

  • Predominantly working in Julia for the last decade

Early days

Hardware of the 1960’s and 1970’s

  • Most computing was on “mainframe” computers that occupied entire rooms.

  • A university might have a single computer, often an IBM System/360 (360 → “all round” computer).

  • Permanent data storage was limited to 9-track tape, only allowing sequential processing of records.

  • Biggest research use at universities was often batch-oriented statistical software such as BMDP (1965), SPSS (1968), and SAS (1972).

Hardware came with operating system and compilers.

  • Cobol, for commercial applications, and Fortran, for scientific computing, dated from the 1950’s

  • IBM pushed PL/I very hard, but not successfully

    • initial versions of SAS were written in PL/I, Fortran and (IBM) assembler

    • in 1985 SAS was re-written in C so it could be ported to microcomputers

Bell Labs, Unix, and S

The Unix Operating System

  • Many dramatic changes in computing environments in 70’s and 80’s came from AT&T Bell Labs

    • The Unix (1971, 1973) operating system for minicomputers (DEC PDP/11) and later hardware (DEC VAX/11, workstations)

    • C (1972) and C++ (1985) programming languages

    • S (1976, 1980) language for statistical computing and graphics

Software tools

  • The Unix philosophy was centered around “software tools” as described in books by Brian Kernighan and his co-authors.

  • The approach emphasized functions and pipelines or composability instead of subroutines.

  • The original versions of S (aka “Old S”) followed this philosophy, using Ratfor and m4 to glue together algorithms and self-describing data objects.

Statistical modeling project

  • In the mid-80’s John Chambers create a new language QPE with representations for both code and data. Later this became the "new" S language.

  • The “Statistical Models in S” project aimed for a consistent approach to many different types of statistical models using classes and (S3) methods implemented as attributes.

  • Much of the work was done in the late 80’s but the book appeared in 1991.

  • System for data analysis and graphics (Bill Cleveland’s influence).

What set S apart?

  • Self-describing data structures, including recursive structures (lists).

  • Code expressed as functions and function calls.

  • Interaction through a REPL (read-eval-print-loop) or inclusion of program files.

  • Code was a first-class object - functions could generate expressions or other functions.

  • Complexity (e.g. S3 classes) was often achieved through attributes.

Objects in S were in-memory

  • SAS, SPSS, etc. worked as filters on external row-oriented tables in files

  • S was based on vectors (including factors or ordered factors)

    • a data.frame was a named, ordered collection of vectors
  • “columnar” views of data tables, as opposed to row-oriented approaches (including CSV), allow more flexibility in data science.

  • Many analysts, betting against technological progress, felt that S could not succeed because it couldn’t handle “big data”.

S outside Bell Labs

  • Because AT&T was a regulated monopoly (long-distance telephone lines), they could not also sell software.

  • They could share software for research purposes. A small number of statistics depts were designated “beta test sites” for S.

  • A couple of times a year we would receive a 9-track tape with the software, a couple pages of instructions, and their best wishes.

Commercial versions of S

  • Porting S to Windows, or to the many different versions of Unix on workstations from DEC, HP, IBM, Sun, etc. was difficult, time-consuming and required skilled personnel.

  • S-PLUS (1988) was created and marketed by a spin-off, initially called StatSci, of the U. of Washington Stats Dept.

  • They licensed the base S code from Bell Labs and provided their own environment and packages.

  • Data graphics was a strong selling point.

Other early technology developments

Personal computers

  • In the late ’70s and through the 80’s early microcomputers (Apple II, IBM PC (1981), Apple Macintosh were introduced but did not have the power to compile and run statistical software.

  • By 1985 their potential was sufficient to cause SAS Institute to re-write their software in C

Free (Open-Source) Software

  • The FSF (1985) was formed to promote free software, especially the emacs editor and the GNU system, using licenses like the GPL to prevent commercial capture.

  • The Linux (1991) kernel in conjunction with GNU software provided a functioning Open-Source computer system, often used for web-servers on the newly-developed Internet in the L.A.M.P. (Linux, Apache, MySQL, PHP/Perl/Python) bundle.

  • In combination with version control systems such servers allowed for distributed development and dissemination of software.

Peculiar economics of free software

  • Especially at universities, setting up a web server was not horribly difficult. Some maintenance was required but not an overwhelming load.

  • The first copy of software was expensive (in people time) to produce. Subsequent copies were essentially free.

  • Support could be “crowd sourced” through email lists, wikis, etc.

The R Project

Initial implementation

  • Ross Ihaka and Robert Gentleman created a language that was “not unlike S” at U. of Auckland, initially released in 1993.

  • The design of the language itself followed that of S, internally it was more influenced by Scheme.

  • One important internal difference relative to S was in memory allocation and garbage collection.

    • S used memory-mapping

Building community

  • Martin Mächler made many early contributions and in 1995 convinced “R & R” to release the source code under the GNU General Public Licence (GPL).

  • Other statisticians, especially those already using S, began to contribute.

  • Mailing lists, CRAN (source code and package repository), and R-Core were all created in 1997.

  • R version 1.0.0 was released on Feb. 29, 2000 - major milestone was implementation of S4 classes and methods (multiple dispatch).

Extensibility

  • The goal of the S language was to allow users to progress through stages of use

    • use REPL as sophisticated calculator
    • write scripts for repeated operations
    • replace repeated operations with function definitions
    • gather functions, data, and documentation into a local package
    • publish a package for use by others
  • CRAN, and other repositories like Bioconductor, became clearing houses of research and methodology and examples. Quality control and consistency of results are important.

  • “literate programming” (e.g. Sweave, knitr, Quarto) provided reproducible report generation.

  • Jupyter notebooks allow interactive evaluation in documents.

Contrast with commercial model

  • Most commercial software does not encourage extension by users.

  • Eric Raymond’s book The Cathedral and the Bazaar describes the differences in approach of commercial and open-source software.

  • Commercial statistical software emphasized GUIs, which are fine to start with but ultimately restrictive.

Commercial support

  • Posit.co, formerly RStudio.com, is a public benefit corporation that independently developed

  • To many users Posit.co is more central to R than is R-Core.

  • It is a separate entity and a commercial enterprise but complementary to and supportive of the R Project.

Where we are now

Powerful open-source tools

  • R and Python very widely-used for data science.

  • Julia has a smaller, but dedicated, following.

  • Each language is supported by thousands of packages (>500,000 on pypi.org, >20,000 on CRAN, >10,000 on Julia reg)

  • Even Microsoft supports Open-Source through github.com, VSCode, and other efforts.

  • The combinations of languages and packages provide incredible opportunities but so much to learn and remember.

Each language can be used

  • in a REPL (read-eval-print-loop)

  • through scripts or notebooks (e.g. Jupyter, whose name is built from “Julia, Python and R”)

  • in document-creation systems like Quarto

  • across many platforms from laptops to cloud servers to supercomputers

  • in conjunction with version control systems like git and internet repositories like github or gitlab

Open-source wasn’t a sure thing

  • In the 1990’s open-source was still considered kind-of “iffy”. Our efforts on R were dismissed as “freeware” or as the “student edition” of S-PLUS.

  • Expensive commercial systems used proprietary formats in batch-oriented scripts or restrictive GUI’s. (Minitab and Matlab were exceptions).

  • Eric Raymond contrasted these approaches in The Cathedral and the Bazaar.

Bazaar - accessible but confusing

  • It is wonderful to have several different languages and systems with which to do data science.

  • Teams may need to use multiple languages.

  • In any case, it is good to have exposure to more than one way to do an analysis.

  • But few people have the time or capacity to learn to use all these languages effectively.

  • In some ways it is a matter of balancing time to learn tools versus time spent using tools.

Reliable data interchange

  • In a polyglot data science world it becomes important to allow different team members to collaborate across languages.

  • Arrow is an explicitly defined, column-oriented format for in-memory data tables or external files.

  • Allows for missing data & flags columns with none.

  • Allows for Dictionary Encoding, i.e. factor-like structures

  • Memory image is similar to in-memory vector representations in Pandas, Polars, R, and Julia.

  • Files can be compressed or memory-mapped.