Arrow from JSON

code
Julia
HTTP
JSON
Arrow
Author

Douglas Bates

Published

December 4, 2022

Arrow from JSON in Julia

I am an enthusiastic user of the Julia programming language. In my opinion, it provides “the best of both worlds” for data science in that it is a dynamically-typed language that can be used in a REPL, and the code can be quite performant because of Julia’s JIT (Just-In-Time) compiler. However, not all Julia code will end up being fast. Often, an impediment to performance is code in which the type of an object cannot be determined at compile time.

In this post I explore a common scenario in data science; receiving serialized data in JSON format and creating a table from this JSON. The deserialization of the JSON can result in vaguely-typed objects with resulting inefficient code due to poor type inference.

I will show a way to create avoid the vague typing using NamedTuples that can be collected in Tables using the TypedTables.jl package. These tables can then be stored as Arrow files which can later be read into whatever language is preferred by the analyst.

The Movie DB

There are many sources of information about movies, television shows, performers, etc. on the internet. I prefer to use themoviedb.org (also accessible as tmdb.org) because its data is unemcumbered. You can search on the web site without an account. If you establish a (free) account, you can contribute to the ratings and create collections of items of interest to you. If you go one step further and request an API Key, you can access their database via HTTP requests.

As a non-trivial, freely-available server with a public API it is an excellent test-case when learning programming of internet protocols. Also, it is fun to check out details on movies, etc.

In this posting I will show methods for downloading and processing some of their daily file export tables. Access to these tables doesn’t require an API Key.

These tables are used to associate names and id numbers of movies, television series, people etc. It is worthwhile having local copies to be able to do quick lookups of ids or names when you encounter them, without needing to go through formulating an HTTP request, parsing the response, untangling the JSON encoding, etc.

Attach packages

First, we attach the packages to be used to make HTTP requests, to parse JSON, and to work with various table formats.

Code
using Arrow       # the arrow.apache.org column-oriented table format
using CodecZlib   # file compression formats on IO streams
using Dates       # time/date formats
using HTTP        # HTTP communication for clients or servers
using JSON3       # read and write JSON with a slick interface for structs
using TypedTables # low-overhead row-oriented table wrapper
using URIs        # construct or deconstruct URI strings
┌ Info: Precompiling Arrow [69666777-d1a9-59fb-9406-91d4454c9d45]
└ @ Base loading.jl:1664

Form of the URI

A Uniform Resource Locators (URL) is a specific type of Uniform Resource Identifier or URI. See the discussion in the linked Wikipedia article on URI’s.

The URIs for these downloads are constructed from the type of ids returned (e.g. keyword, movie, person) and the date when the file was created. For example, the URI for today’s table of keyword ids is

http://files.tmdb.org/p/exports/keyword_ids_12_04_2022.json.gz

The URIs.jl package for Julia provides capabilities to construct and deconstruct such URIs.

kwURI = URI("http://files.tmdb.org/p/exports/keyword_ids_12_04_2022.json.gz")
propertynames(kwURI)
(:uri, :scheme, :userinfo, :host, :port, :path, :query, :fragment)

The important parts of this URI for our purposes are the scheme, host and path.

Code
@show kwURI.scheme kwURI.host kwURI.path;
kwURI.scheme = "http"
kwURI.host = "files.tmdb.org"
kwURI.path = "/p/exports/keyword_ids_12_04_2022.json.gz"

The URI generic can also be used to create a URI from its components. Creating a URI in this way has the advantage that the individual components are checked to ensure that they are in the proper form.

The path part of the URI is intentionally similar to file paths in a file system, and many of the generic functions for working with file paths in Julia are generalized to URI’s. In particular, there is a method for joinpath that takes a URI and one or more strings to append to the path.

In this example we can create the fixed part of the URI and generate the variable information on the trailing part programmatically. Because the date is encoded as part of the URI, we attach the Dates package from Julia’s standard library to be able to work with different date formats.

Code
todaystr = Dates.format(today(), dateformat"mm_dd_yyyy")
"12_09_2022"

The URI of the latest version of the keyword ids download file can thus be generated as

uridir = URI(scheme="http", host="files.tmdb.org", path="/p/exports")
uri = joinpath(uridir, "keyword_ids_$todaystr.json.gz")
URI("http://files.tmdb.org/p/exports/keyword_ids_12_09_2022.json.gz")

Downloading the file

The HTTP package provides a download function to download a file. By default the file is downloaded to a temporary location that is cleared when the Julia session is over. We will instead download to a local file, which allows us to check if that file already exists and to avoid repeated downloads while developing this code.

We use the last part, called the basename, of the path in the uri as the filename.

kwfname = basename(uri.path)
if !isfile(kwfname)
    HTTP.download(string(uri), kwfname)
end
filesize(kwfname)
┌ Info: Downloading
│   source = http://files.tmdb.org/p/exports/keyword_ids_12_09_2022.json.gz
│   dest = keyword_ids_12_09_2022.json.gz
│   progress = 1.0
│   time_taken = 0.11 s
│   time_remaining = 0.0 s
│   average_speed = 4.264 MiB/s
│   downloaded = 489.034 KiB
│   remaining = 0 bytes
│   total = 489.034 KiB
└ @ HTTP /home/bates/.julia/packages/HTTP/RQd4C/src/download.jl:132
500771

Because there is no method defined for a URI yet, only for AbstractString.

methods(HTTP.download)
# 3 methods for generic function download:

but there probably should be.

The file could be decompressed, using whatever facilities the operating system provides, then read. An alternative is to open the file as an IOStream and stream it through a GZipDecompressorStream.

kwstrm = GzipDecompressorStream(open(kwfname, "r"))
TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}(<mode=idle>)

Finally, as is stated in the API documentation, the file, after decompression, isn’t a single JSON expression - as would be expected for a file with a .json file extension. Instead, each line is a JSON expression and all the lines evaluate to the same type of object. Fortunately, this approach is sufficiently common that JSON3.read has an optional argument jsonlines to flag this situation.

To see exactly the form, we examine the first line.

println(String(readline(seekstart(kwstrm))))
{"id":378,"name":"prison"}

As kwstrm has just been opened, it is redundant to call seekstart on it to reset the reading position to the beginning of the stream. However I often forget to do that on an existing stream and wonder why the contents don’t look as I expect them to, so I use this idiom.

We can see that the structure is very simple - an id number as an integer and a name as a string.

General table types in Julia

The Tables.jl package defines general concepts of row-oriented and column-oriented tables in Julia.

In a column-oriented table the columns are stored as vectors and the table consists of a named, ordered collection of vectors, which must all have the same length. For example, the DataFrames.jl package provides an implementation of mutable (i.e. capable of being modified), column-oriented, versatile tables.

The mutability refers to the ability to add, remove, replace, or rename columns. But mutability comes with a cost in complexity and size of the package. If we are willing to work with an immutable table we can use a NamedTuple of vectors (column-oriented) or a Vector{NamedTuple} (row-oriented). A NamedTuple is a strongly typed, ordered collection of named fields.

For example

nt = JSON3.read(readline(seekstart(kwstrm)), NamedTuple)
(id = 378, name = "prison")
typeof(nt)
NamedTuple{(:id, :name), Tuple{Int64, String}}

The concrete type of a NamedTuple includes the names (as Symbol’s), the order, and the types of the fields in the object.

A vector of a specific NamedTuple type is the archetypal row-oriented table in Julia.

The TypedTables.jl package provides a Table type that allows manipulation of row-oriented tables.

We can convert the jsonlines file to a Table as

kwtbl = Table(JSON3.read(seekstart(kwstrm), Vector{NamedTuple}; jsonlines=true))
Table with 2 columns and 49079 rows:
      id     name
    ┌─────────────────────────
 1  │ 378    prison
 2  │ 240    underdog
 3  │ 1787   helsinki, finland
 4  │ 730    factory worker
 5  │ 1361   salesclerk
 6  │ 4529   diplomat
 7  │ 1008   guerrilla warfare
 8  │ 612    hotel
 9  │ 613    new year's eve
 10 │ 616    witch
 11 │ 622    bet
 12 │ 2700   sperm
 13 │ 922    hotel room
 14 │ 2231   drug dealer
 15 │ 520    chicago, illinois
 16 │ 3737   dying and death
 17 │ 544    sailboat
 18 │ 11195  empire
 19 │ 4270   galaxy
 20 │ 7376   princess
 21 │ 10084  rescue
 22 │ 11196  rebellion
 23 │ 4932   farm
 ⋮  │   ⋮            ⋮
eltype(kwtbl)
NamedTuple{(:id, :name), Tuple{Int64, String}}

If we want to be more specific, say we want 32-bit integers instead of 64-bit integers for the id number, we provide a concrete type of NamedTuple instead of the abstract type.

kwtbl32 = JSON3.read(
    seekstart(kwstrm),
    Vector{NamedTuple{(:id, :name),Tuple{Int32, String}}};
    jsonlines=true,
)
49079-element Vector{NamedTuple{(:id, :name), Tuple{Int32, String}}}:
 (id = 378, name = "prison")
 (id = 240, name = "underdog")
 (id = 1787, name = "helsinki, finland")
 (id = 730, name = "factory worker")
 (id = 1361, name = "salesclerk")
 (id = 4529, name = "diplomat")
 (id = 1008, name = "guerrilla warfare")
 (id = 612, name = "hotel")
 (id = 613, name = "new year's eve")
 (id = 616, name = "witch")
 (id = 622, name = "bet")
 (id = 2700, name = "sperm")
 (id = 922, name = "hotel room")
 ⋮
 (id = 308547, name = "chamber orchestra")
 (id = 308548, name = "narcissistic woman")
 (id = 308549, name = "fart machine")
 (id = 308550, name = "eurovision national competition")
 (id = 308551, name = "hidden potential")
 (id = 308552, name = "sumo wrestling")
 (id = 308553, name = "ona")
 (id = 308554, name = "rhythm game")
 (id = 308557, name = "angel dust")
 (id = 308558, name = "dusted")
 (id = 308559, name = "vr film")
 (id = 308561, name = "ex-god")

Save a table in Arrow format

Arrow is a column-oriented, tabular format that has been implemented in several languages, including the Arrow.jl package for Julia.

Arrow is also used as the internal storage format for several projects such as the DuckDB database and the Polars DataFrame library for Rust and Python.

Writing the table to an Arrow format file

To make it easier to search for an id in the table, we sort kwtbl32 by id while writing it to an Arrow file.

kwarrow = Arrow.write(
    "keyword.arrow",
    sort(kwtbl32; by=getproperty(:id));
    metadata=["URI" => string(uri)],
)
"keyword.arrow"

Such a file can now be read by packages in many different languages.

Of course, it can be read with the Julia Arrow.jl package

newtbl = Arrow.Table(kwarrow)
Arrow.Table with 49079 rows, 2 columns, and schema:
 :id    Int32
 :name  String

with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
  "URI" => "http://files.tmdb.org/p/exports/keyword_ids_12_09_2022.json.gz"
first(Table(newtbl), 10)
Table with 2 columns and 10 rows:
      id   name
    ┌──────────────────────
 1  │ 30   individual
 2  │ 65   holiday
 3  │ 74   germany
 4  │ 75   gunslinger
 5  │ 83   saving the world
 6  │ 90   paris, france
 7  │ 100  slum
 8  │ 107  barcelona, spain
 9  │ 108  transvestism
 10 │ 110  venice, italy

In Python we could use the pyarrow and polars packages as

$ python
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.feather as feather
>>> import polars as pl
>>> feather.read_table("keyword.arrow")
pyarrow.Table
id: int32 not null
name: string not null
----
id: [[30,65,74,75,83,...,308144,308145,308146,308147,308150]]
name: [["individual","holiday","germany","gunslinger","saving the world",...,"boxing match","assassination plot","lifelong friends","navy commander","planetary alignment"]]
>>> feather.read_feather("keyword.arrow")  # as a pandas DataFrame
           id                 name
0          30           individual
1          65              holiday
2          74              germany
3          75           gunslinger
4          83     saving the world
...       ...                  ...
48838  308144         boxing match
48839  308145   assassination plot
48840  308146     lifelong friends
48841  308147       navy commander
48842  308150  planetary alignment

[48843 rows x 2 columns]
>>> pl.from_arrow(feather.read_table('keyword.arrow'))
shape: (48843, 2)
┌────────┬─────────────────────┐
 id     ┆ name                │
 ------
 i32    ┆ str                 │
╞════════╪═════════════════════╡
 30     ┆ individual          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 65     ┆ holiday             │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 74     ┆ germany             │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 75     ┆ gunslinger          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 ...    ┆ ...                 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 308145 ┆ assassination plot  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 308146 ┆ lifelong friends    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 308147 ┆ navy commander      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 308150 ┆ planetary alignment │
└────────┴─────────────────────┘

Automating the process

When automating the process we avoid the downloading to a file and use the HTTP.jl package to send an HTTP.get request whose response body is piped to an IOStream that is uncompressed on the fly.

"""
    dlidtbl(; date::Date=today(), uridir::URI, templates::Dict)

Download the tables of ids from `files.tmdb.org` and write them to Arrow files

The name and schema of each table are given in `templates`,
whose default value downloads all the ids files.
"""
function dlidtbl(;    # no positional arguments, only named arguments with defaults
    date::Date=today(),
    uridir::URI = URI(scheme="https", host="files.tmdb.org", path="/p/exports"),
    templates::Dict{<:AbstractString,<:NamedTuple} = Dict(
        "movie" => (adult=false, id=Int32(0), original_title="", popularity=1.0f0, video=false),
        "tv_series" => (id=Int32(0), original_name="", popularity=1.0f0),
        "person" => (adult=false, id=Int32(0), name="", popularity=1.0f0),
        "collection" => (id=Int32(0), name=""),
        "tv_network" => (id=Int32(0), name=""),
        "keyword" => (id=Int32(0), name=""),
        "production_company" => (id=Int32(0), name=""),
    )
)
    datestr = Dates.format(date, dateformat"mm_dd_yyyy") # format used in file names
    for nm in keys(templates)
        uri = joinpath(uridir, string(nm, "_ids_", datestr, ".json.gz"))
        r = HTTP.get(uri; response_stream=IOBuffer())
        if r.status  200  # response status of 200 indicates success
            throw(ArgumentError("get request to \"$uri\" returned status=$(r.status)"))
        end
        tbl = JSON3.read(
            GzipDecompressorStream(seekstart(r.body)),
            Vector{typeof(templates[nm])};
            jsonlines=true)
        Arrow.write(
            string(nm, ".arrow"),
            sort(tbl; by=getproperty(:id));
            compress=:lz4,
            metadata=["URI" => string(uri)],
        )
    end
    return nothing
end
dlidtbl
dlidtbl()
filter(endswith(".arrow"), readdir())
7-element Vector{String}:
 "collection.arrow"
 "keyword.arrow"
 "movie.arrow"
 "person.arrow"
 "production_company.arrow"
 "tv_network.arrow"
 "tv_series.arrow"

Conclusion

That’s it for this example. I hope this has stimulated interest in some of the Julia tools for sending HTTP requests and parsing the responses.

Next I plan to illustrate more advanced usage of the tmdb.org API.