I am an enthusiastic user of the Julia programming language. In my opinion, it provides “the best of both worlds” for data science in that it is a dynamically-typed language that can be used in a REPL, and the code can be quite performant because of Julia’s JIT (Just-In-Time) compiler. However, not all Julia code will end up being fast. Often, an impediment to performance is code in which the type of an object cannot be determined at compile time.
In this post I explore a common scenario in data science; receiving serialized data in JSON format and creating a table from this JSON. The deserialization of the JSON can result in vaguely-typed objects with resulting inefficient code due to poor type inference.
I will show a way to create avoid the vague typing using NamedTuples that can be collected in Tables using the TypedTables.jl package. These tables can then be stored as Arrow files which can later be read into whatever language is preferred by the analyst.
The Movie DB
There are many sources of information about movies, television shows, performers, etc. on the internet. I prefer to use themoviedb.org (also accessible as tmdb.org) because its data is unemcumbered. You can search on the web site without an account. If you establish a (free) account, you can contribute to the ratings and create collections of items of interest to you. If you go one step further and request an API Key, you can access their database via HTTP requests.
As a non-trivial, freely-available server with a public API it is an excellent test-case when learning programming of internet protocols. Also, it is fun to check out details on movies, etc.
In this posting I will show methods for downloading and processing some of their daily file export tables. Access to these tables doesn’t require an API Key.
These tables are used to associate names and id numbers of movies, television series, people etc. It is worthwhile having local copies to be able to do quick lookups of ids or names when you encounter them, without needing to go through formulating an HTTP request, parsing the response, untangling the JSON encoding, etc.
Attach packages
First, we attach the packages to be used to make HTTP requests, to parse JSON, and to work with various table formats.
Code
usingArrow # the arrow.apache.org column-oriented table formatusingCodecZlib # file compression formats on IO streamsusingDates # time/date formatsusingHTTP # HTTP communication for clients or serversusingJSON3 # read and write JSON with a slick interface for structsusingTypedTables # low-overhead row-oriented table wrapperusingURIs # construct or deconstruct URI strings
┌ Info: Precompiling Arrow [69666777-d1a9-59fb-9406-91d4454c9d45]
└ @ Base loading.jl:1664
The URIs for these downloads are constructed from the type of ids returned (e.g. keyword, movie, person) and the date when the file was created. For example, the URI for today’s table of keyword ids is
The URI generic can also be used to create a URI from its components. Creating a URI in this way has the advantage that the individual components are checked to ensure that they are in the proper form.
The path part of the URI is intentionally similar to file paths in a file system, and many of the generic functions for working with file paths in Julia are generalized to URI’s. In particular, there is a method for joinpath that takes a URI and one or more strings to append to the path.
In this example we can create the fixed part of the URI and generate the variable information on the trailing part programmatically. Because the date is encoded as part of the URI, we attach the Dates package from Julia’s standard library to be able to work with different date formats.
The HTTP package provides a download function to download a file. By default the file is downloaded to a temporary location that is cleared when the Julia session is over. We will instead download to a local file, which allows us to check if that file already exists and to avoid repeated downloads while developing this code.
We use the last part, called the basename, of the path in the uri as the filename.
The file could be decompressed, using whatever facilities the operating system provides, then read. An alternative is to open the file as an IOStream and stream it through a GZipDecompressorStream.
Finally, as is stated in the API documentation, the file, after decompression, isn’t a single JSON expression - as would be expected for a file with a .json file extension. Instead, each line is a JSON expression and all the lines evaluate to the same type of object. Fortunately, this approach is sufficiently common that JSON3.read has an optional argument jsonlines to flag this situation.
To see exactly the form, we examine the first line.
println(String(readline(seekstart(kwstrm))))
{"id":378,"name":"prison"}
Why seekstart?
As kwstrm has just been opened, it is redundant to call seekstart on it to reset the reading position to the beginning of the stream. However I often forget to do that on an existing stream and wonder why the contents don’t look as I expect them to, so I use this idiom.
We can see that the structure is very simple - an id number as an integer and a name as a string.
General table types in Julia
The Tables.jl package defines general concepts of row-oriented and column-oriented tables in Julia.
In a column-oriented table the columns are stored as vectors and the table consists of a named, ordered collection of vectors, which must all have the same length. For example, the DataFrames.jl package provides an implementation of mutable (i.e. capable of being modified), column-oriented, versatile tables.
The mutability refers to the ability to add, remove, replace, or rename columns. But mutability comes with a cost in complexity and size of the package. If we are willing to work with an immutable table we can use a NamedTuple of vectors (column-oriented) or a Vector{NamedTuple} (row-oriented). A NamedTuple is a strongly typed, ordered collection of named fields.
For example
nt = JSON3.read(readline(seekstart(kwstrm)), NamedTuple)
(id = 378, name = "prison")
typeof(nt)
NamedTuple{(:id, :name), Tuple{Int64, String}}
The concrete type of a NamedTuple includes the names (as Symbol’s), the order, and the types of the fields in the object.
A vector of a specific NamedTuple type is the archetypal row-oriented table in Julia.
The TypedTables.jl package provides a Table type that allows manipulation of row-oriented tables.
Table with 2 columns and 49079 rows:
id name
┌─────────────────────────
1 │ 378 prison
2 │ 240 underdog
3 │ 1787 helsinki, finland
4 │ 730 factory worker
5 │ 1361 salesclerk
6 │ 4529 diplomat
7 │ 1008 guerrilla warfare
8 │ 612 hotel
9 │ 613 new year's eve
10 │ 616 witch
11 │ 622 bet
12 │ 2700 sperm
13 │ 922 hotel room
14 │ 2231 drug dealer
15 │ 520 chicago, illinois
16 │ 3737 dying and death
17 │ 544 sailboat
18 │ 11195 empire
19 │ 4270 galaxy
20 │ 7376 princess
21 │ 10084 rescue
22 │ 11196 rebellion
23 │ 4932 farm
⋮ │ ⋮ ⋮
eltype(kwtbl)
NamedTuple{(:id, :name), Tuple{Int64, String}}
If we want to be more specific, say we want 32-bit integers instead of 64-bit integers for the id number, we provide a concrete type of NamedTuple instead of the abstract type.
49079-element Vector{NamedTuple{(:id, :name), Tuple{Int32, String}}}:
(id = 378, name = "prison")
(id = 240, name = "underdog")
(id = 1787, name = "helsinki, finland")
(id = 730, name = "factory worker")
(id = 1361, name = "salesclerk")
(id = 4529, name = "diplomat")
(id = 1008, name = "guerrilla warfare")
(id = 612, name = "hotel")
(id = 613, name = "new year's eve")
(id = 616, name = "witch")
(id = 622, name = "bet")
(id = 2700, name = "sperm")
(id = 922, name = "hotel room")
⋮
(id = 308547, name = "chamber orchestra")
(id = 308548, name = "narcissistic woman")
(id = 308549, name = "fart machine")
(id = 308550, name = "eurovision national competition")
(id = 308551, name = "hidden potential")
(id = 308552, name = "sumo wrestling")
(id = 308553, name = "ona")
(id = 308554, name = "rhythm game")
(id = 308557, name = "angel dust")
(id = 308558, name = "dusted")
(id = 308559, name = "vr film")
(id = 308561, name = "ex-god")
Save a table in Arrow format
Arrow is a column-oriented, tabular format that has been implemented in several languages, including the Arrow.jl package for Julia.
Arrow is also used as the internal storage format for several projects such as the DuckDB database and the Polars DataFrame library for Rust and Python.
Writing the table to an Arrow format file
To make it easier to search for an id in the table, we sort kwtbl32 by id while writing it to an Arrow file.
Such a file can now be read by packages in many different languages.
Of course, it can be read with the Julia Arrow.jl package
newtbl = Arrow.Table(kwarrow)
Arrow.Table with 49079 rows, 2 columns, and schema:
:id Int32
:name String
with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
"URI" => "http://files.tmdb.org/p/exports/keyword_ids_12_09_2022.json.gz"
first(Table(newtbl), 10)
Table with 2 columns and 10 rows:
id name
┌──────────────────────
1 │ 30 individual
2 │ 65 holiday
3 │ 74 germany
4 │ 75 gunslinger
5 │ 83 saving the world
6 │ 90 paris, france
7 │ 100 slum
8 │ 107 barcelona, spain
9 │ 108 transvestism
10 │ 110 venice, italy
In Python we could use the pyarrow and polars packages as
$ pythonPython 3.10.8 |packaged by conda-forge |(main, Nov 22 2022, 08:23:14)[GCC 10.4.0] on linuxType"help", "copyright", "credits" or "license" for more information.>>> import pyarrow.feather as feather>>> import polars as pl>>> feather.read_table("keyword.arrow")pyarrow.Tableid: int32 not nullname: string not null----id: [[30,65,74,75,83,...,308144,308145,308146,308147,308150]]name: [["individual","holiday","germany","gunslinger","saving the world",...,"boxing match","assassination plot","lifelong friends","navy commander","planetary alignment"]]>>> feather.read_feather("keyword.arrow")# as a pandas DataFrameid name0 30 individual1 65 holiday2 74 germany3 75 gunslinger4 83 saving the world... ... ...48838 308144 boxing match48839 308145 assassination plot48840 308146 lifelong friends48841 308147 navy commander48842 308150 planetary alignment[48843 rows x 2 columns]>>> pl.from_arrow(feather.read_table('keyword.arrow'))shape:(48843, 2)┌────────┬─────────────────────┐│ id ┆ name ││--- ┆ --- ││ i32 ┆ str │╞════════╪═════════════════════╡│ 30 ┆ individual │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 65 ┆ holiday │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 74 ┆ germany │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 75 ┆ gunslinger │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ ... ┆ ... │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 308145 ┆ assassination plot │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 308146 ┆ lifelong friends │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 308147 ┆ navy commander │├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤│ 308150 ┆ planetary alignment │└────────┴─────────────────────┘
Automating the process
When automating the process we avoid the downloading to a file and use the HTTP.jl package to send an HTTP.get request whose response body is piped to an IOStream that is uncompressed on the fly.
""" dlidtbl(; date::Date=today(), uridir::URI, templates::Dict)Download the tables of ids from `files.tmdb.org` and write them to Arrow filesThe name and schema of each table are given in `templates`,whose default value downloads all the ids files."""functiondlidtbl(; # no positional arguments, only named arguments with defaults date::Date=today(), uridir::URI =URI(scheme="https", host="files.tmdb.org", path="/p/exports"), templates::Dict{<:AbstractString,<:NamedTuple} =Dict("movie"=> (adult=false, id=Int32(0), original_title="", popularity=1.0f0, video=false),"tv_series"=> (id=Int32(0), original_name="", popularity=1.0f0),"person"=> (adult=false, id=Int32(0), name="", popularity=1.0f0),"collection"=> (id=Int32(0), name=""),"tv_network"=> (id=Int32(0), name=""),"keyword"=> (id=Int32(0), name=""),"production_company"=> (id=Int32(0), name=""), )) datestr =Dates.format(date, dateformat"mm_dd_yyyy") # format used in file namesfor nm inkeys(templates) uri =joinpath(uridir, string(nm, "_ids_", datestr, ".json.gz")) r = HTTP.get(uri; response_stream=IOBuffer())if r.status ≠200# response status of 200 indicates successthrow(ArgumentError("get request to \"$uri\" returned status=$(r.status)"))end tbl = JSON3.read(GzipDecompressorStream(seekstart(r.body)),Vector{typeof(templates[nm])}; jsonlines=true) Arrow.write(string(nm, ".arrow"),sort(tbl; by=getproperty(:id)); compress=:lz4, metadata=["URI"=>string(uri)], )endreturnnothingend