Addressing my language gaps
I work with data every day. Whether it’s for my job at EPA or my various hobby projects, I have a text editor and command line open just about every day on my computer. Importantly, I work with large amounts of data. My primary research project at EPA was based on a dataset with over 12 million rows and a model with some pretty expensive function evaluations, and I regularly work with second-by-second emissions data from heavy-duty trucks, often adding up to over millions of records. The most common tool used in the office for this work is R, but it doesn’t scale very well, in terms of memory or computation time.
I started trying to fix this problem by parallelizing, but R only handles embarrassingly parallel computation well. Even then, R’s parallel package can be pretty clunky and often leads to memory bloat. Looking for a language that scales well, I started doing a lot of my data analysis in Go, based on reputation and its highly explicit syntax and error handling.
Go is nice because it’s fast and parallelizes well in both the embarrassing and non-embarrassing case, but it doesn’t have great tooling for either heavy-duty math or visualization - two areas where R continues to shine despite its imperfections. So almost as soon as I started using Go for data analysis, I started trying to use Go and R in the same pipeline so I could take advantage of the best parts of both languages in a single project. After some trial and error and iteration, the Rgo repositorywas born, and the most comprehensive and successful effort has been the sexp package.
The goal of the sexp package is to connect Go and R directly through C. This allows users to take advantage of the great features of R (easy development, dynamic typing, and great visualization tools) with the great features of Go (performance, type safety, and nontrivial parallelization).
There’s an easy way and a hard way to call Go from R. The easy way is via R’s system command, calling pre-compiled Go code (or, perhaps more commonly, using go run) where text files pass data back and forth. The problem with this is that it relies on slow I/O operations and doesn’t really allow good iteration during development.
Because of that limitation, the sexp package connects Go and R the hard way - through C. Go and R can both compile to C, which allows them to speak the same language. It might seem intuitive to simply compile Go code to C and use R’s C interface to call it, but that’s just a more difficult version of the easy way we just ruled out. Instead, it’s better to use C only to pass data back and forth so that both languages have access to the data without too much I/O.
While Go compiles to C (via the appropriately named cgo) and the data types translate mostly as expected, R’s objects look nothing in C like they do in R. Under the hood, all objects in R stored as S-expressions. In order to understand how to pass data from R to Go, it’s necessary to have a functional knowledge of S-expressions and how to extract data from them.
S-expressions are based on pointers. This helps R’s performance because it can take advantage of some clever copying to reduce memory allocations and use an address-based garbage collector. The S-expression type (called SEXP) in R is actually just a pointer to a SEXPREC object in C, which is (kind of) where the data itself is located.
The SEXPREC contains two things: a header and a struct. The header is essentially just metadata, such as type and length of the underlying object. The struct contains the data itself, or rather, a pointer to the data itself. R provides internal functions to extract the data out of a SEXP object, the most important of which include TYPEOF to return the type of data in the SEXP, LENGTH to return the length of the underlying vector, and REAL and INTEGER (and others) to return the data itself, also functioning as a sort of type assertion.
Under the hood, the vectors in S-expressions are similar to Go’s slices, comprised of a pointer to an underlying array. The key difference is that the pointer in the SEXPREC only points to the beginning of the array because C allows pointer arithmetic. All one does to get each element of the vector is sequentially index the SEXPREC pointer.
Broadly speaking there are two key differences between R/C and Go that need to be reconciled. First, the type system in R is not completely compatible with Go, even after following the SEXP pointers. Second, Go doesn’t allow pointer arithmetic, which is essential to extracting data from a SEXP object in C.
There are two key numeric types which need to be passed between R and Go: integers and floats (called doubles in R/C). R’s doubles mirror perfectly Go’s float64 type, and converting from one to the other in cgo is trivial. Likewise, R’s integer data type mirrors Go’s int perfectly. The sexp package doesn’t use more specific integer types in Go like int16 or uint64, nor does it use the float32 type.
The tough type to reconcile is strings. In Go, strings are just read-only slices of bytes. In R, however, strings are vectors of the char type in C, which is represented differently under the hood. In order to reconcile the differences, sexp convert’s R’s char type to bytes, but then it has to convert the bytes to integers because R’s encoding is also different from Go. Luckily, converting from integers to bytes and then to a string is trivial in Go.
In order to make communication easier, everything passed between R and Go via sexp is a vector, even if it’s just length 1. Matrices aren’t currently supported either (at least directly - one could pass several vectors and reformat to a matrix), but it may be a good candidate for future improvement. More complex types, such as R’s lists or Go’s structs/interfaces are also not supported and, in my opinion, more work to reconcile than they’re worth.
The second reconciliation, pointer arithmetic, is annoying but ultimately not hard to solve. Using C code written in the cgo preamble to extract a single element from a SEXP object given a type and index is trivial, and insertion into an existing SEXP is also pretty easy. For Go and R users who don’t necessarily know C that well (like me), there’s an advantage in only using simple extract/insert function from C so that Go is used to check types, get lengths, and create new SEXP objects. All Go needs to do is call the simple C functions in a for loop to insert/extract vector elements and then pass them to R.
The last issue to sort out is exposing the sexp package types (particularly the SEXPREC itself) to other Go packages. cgo doesn’t allow C types to be passed between Go packages, but the functions to insert and extract values obviously require SEXP inputs. Unfortunately, the only way to pass the SEXP aliases around is via unsafe pointers. This is, as the name implies, type unsafe and can be error prone but I can’t think of any other way. In order to make the unsafe pointer a little more safe, the sexp package provides the GoSEXP type, which is a simple struct container for the unsafe pointer with a clear name. There’s still type assertion to do and it’s far from foolproof, but it hopefully makes code including the SEXP object a little more clear.
Overall, the sexp package can be a helpful tool when your data analysis is complex or big enough to warrant the use of both R and Go. It’s obviously faster and more flexible than using R to call pre-compiled code and eases the I/O burden. It’s pretty minimalist in its design for now and easy to break (though I’ve found so far that it’s easy to not break as well), so there’s a list of improvements I’d like to make on the Github page. I’ve also got a simple demo up on my shiny server.
My next post will demonstrate a clear use case for the sexp package, using the publicly available repository of programs provided by the New York Philharmonic!