Advanced R programming by Hadley Wickham


R's C interface

Introduction

Reading the source code of R is an extremely powerful technique for improving your R programming. However, at some point you will hit a brick wall: many base R functions are implemented in C. This chapter will give you a basic introduction R's C api. You'll need some basic C knowledge (familiarity with "Kernigan and Ritchie" is recommended), and a lot of patience. But it is possible to read R's C source code, and you will learn a lot doing it.

Even if you've never used C before, you can still puzzle out much of what's going on because the basic syntax of C is similar to R. Some important differences:

  • Variables can only store specific types of object, and must be declared before use.

  • Objects are modified in place, unless you specifically copy the object.

  • Indices start at 0, not 1.

  • You must use a semi-colon at end of each expression.

  • You must have an explicit return statement.

  • Assignment is done with =, not <-.

The contents of this draw heavily from Section 5 ("System and foreign language interfaces") of Writing R extensions, but focus on best practices and modern tools. This means it does not cover the old .C interface, the old api defined in Rdefines.h or rarely used language features. You are unlikely to see these in modern R code.

To see the complete C api provided by R, look at the header file Rinternals.h. It's easiest to find and display this file from within R:

rinternals <- file.path(R.home("include"), "Rinternals.h")
file.show(rinternals)

All functions are defined with either the prefix Rf_ or R_ but are exported without it.

I do not recommend using C for writing new high-performance code. Instead use Rcpp and write C++. The Rcpp API protects you from many of the historical idiosyncracies of the R API, takes care of memory management for you, and provides many useful helper methods.

You'll start by learning how to call R functions from C in calling C. Next, in C data structures{#c-data-structures}, you'll learn how to translate data structure names from R to C. Unforutnately the R API harkens back to early years of R so this is not very easy/consistent. Once you know what vectors are called in C you'll want to create, coerce and modify them, as described in creating and modifying vectors. The distinction between pairlists and list is more important in C than R, so pairlists gives you the run down. It's easy to crash R if you make invalid assumptions about the type of data supplied to your function, so input validation gives you the basics on input validation. The chapter concludes by describing how to find the C source code corresponding to an R function, finding the C source for a function.

Prerequisites

To understand existing C code, it's useful to generate simple examples of your own that you can experiment with. To that end, all examples in this chapter use the inline package, which makes it extremely easy to compile and link C code to your current R session. Get it by running install.packages("Rcpp").

You'll also need a C compiler. Windows users can use Duncan Murdoch's Rtools. Mac users will need the Xcode command line tools. Most Linux distributions will come with the necessary compilers.

Calling C functions from R

Generally, to call a C function from R requires two pieces: a C function, and an R wrapper function that uses .Call(). The simple function below adds two numbers together and illustrates some of the important features of coding in C (creating new R vectors, coercing input arguments to the appropriate type and dealing with garbage collection).

// In C ----------------------------------------
#include <R.h>
#include <Rinternals.h>

SEXP add(SEXP a, SEXP b) {
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return result;
}
# In R ----------------------------------------
add <- function(a, b) {
  .Call("add", a, b)
}

(An alternative to using .Call is to use .External. It is used almost identically, except that the C function will recieve a single argument containing a LISTSXP, a pairlist from which the arguments can be extracted. This makes it possible to write functions that take a variable number of arguments. However, it's not commonly used in base R, inline does not currently support .External functions so I don't discuss it further in this chapter).

In this chapter we'll produce the two pieces in one step by using the inline package. This allows us to write:

add <- cfunction(c(a = "integer", b = "integer"), "
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return result;
")
add(1, 5)
#> [1] 6

Before we begin writing and reading C code, we need to know a little about the basic data structures.

C data structures

At the C-level, all R objects are stored in a common datatype, the SEXP. (Technically, this is a pointer to a structure with typedef SEXPREC). A SEXP is a variant type, with subtypes for all R's data structures. The most important types are:

  • REALSXP: numeric vectors
  • INTSXP: integer vectors
  • LGLSXP: logical vectors
  • STRSXP: character vectors
  • VECSXP: lists
  • CLOSXP: functions (closures)
  • ENVSXP: environments

Beware: At the C level, R's lists are VECSXPs not LISTSXPs. This is because early implementations of R used Lisp-like linked lists (now known as "pairlists") before moving to the S-like generic vectors that we now know as lists.

Character vectors are a little more complicated than the other atomic vectors. A STRSXPs contains a vector of CHARSXPs, where each CHARSXP points to C-style string stored in a global pool. This design allows individual CHARSXP's to be shared between multiple strings, reducing memory usage. See object size for more details.

There are also SEXPs for less common object types:

  • CPLXSXP: complex vectors
  • LISTSXP: "pair" lists. At the R level, you only need to care about the distinction lists and pairlists for function arguments, but internally they are used in many more places.
  • DOTSXP: '...'
  • SYMSXP: names/symbols
  • NILSXP: NULL

And SEXPs for internal objects, objects that are usually only created and used by C functions, not R functions:

  • LANGSXP: language constructs
  • CHARSXP: "scalar" strings
  • PROMSXP: promises, lazily evaluated function arguments
  • EXPRSXP: expressions

There's no built-in R function to easily access these names, but we can write one:

sexp_type <- cfunction(c(x = "ANY"), '
  switch (TYPEOF(x)) {
    case NILSXP:      return mkString("NILSXP");
    case SYMSXP:      return mkString("SYMSXP");
    case LISTSXP:     return mkString("LISTSXP");
    case CLOSXP:      return mkString("CLOSXP");
    case ENVSXP:      return mkString("ENVSXP");
    case PROMSXP:     return mkString("PROMSXP");
    case LANGSXP:     return mkString("LANGSXP");
    case SPECIALSXP:  return mkString("SPECIALSXP");
    case BUILTINSXP:  return mkString("BUILTINSXP");
    case CHARSXP:     return mkString("CHARSXP");
    case LGLSXP:      return mkString("LGLSXP");
    case INTSXP:      return mkString("INTSXP");
    case REALSXP:     return mkString("REALSXP");
    case CPLXSXP:     return mkString("CPLXSXP");
    case STRSXP:      return mkString("STRSXP");
    case DOTSXP:      return mkString("DOTSXP");
    case ANYSXP:      return mkString("ANYSXP");
    case VECSXP:      return mkString("VECSXP");
    case EXPRSXP:     return mkString("EXPRSXP");
    case BCODESXP:    return mkString("BCODESXP");
    case EXTPTRSXP:   return mkString("EXTPTRSXP");
    case WEAKREFSXP:  return mkString("WEAKREFSXP");
    case S4SXP:       return mkString("S4SXP");
    case RAWSXP:      return mkString("RAWSXP");
    default:          return mkString("<unknown>");
}')
sexp_type(10)
#> [1] "REALSXP"
sexp_type(10L)
#> [1] "INTSXP"
sexp_type("a")
#> [1] "STRSXP"
sexp_type(T)
#> [1] "LGLSXP"
sexp_type(list(a = 1))
#> [1] "VECSXP"
sexp_type(pairlist(a = 1))
#> [1] "LISTSXP"

(code adapated from R's inspect.c)

Creating and modifying vectors

At the heart of every C function will be a set of conversions between R data structures and C data structures. Inputs and output will always be R data structures (SEXPs) and you will need to convert them to C data structures in order to do any work. This section focusses on vectors because they're the type of object you're most often working with.

An additional complication is the garbage collector: if you don't claim every R object you create, the garbage collector will think they are unused and delete them.

Creating vectors and garbage collection

The simplest way to create an new R-level object is allocVector, which takes two arguments, the type of SEXP (or SEXPTYPE) to create, and the length of the vector. The following code creates a three element list containing a logical vector, a numeric vector and an integer vector:

dummy <- cfunction(body = '
  SEXP vec, real, lgl, ints;

  PROTECT(real = allocVector(REALSXP, 2));
  PROTECT(lgl = allocVector(LGLSXP, 10));
  PROTECT(ints = allocVector(INTSXP, 10));

  PROTECT(vec = allocVector(VECSXP, 3));
  SET_VECTOR_ELT(vec, 0, real);
  SET_VECTOR_ELT(vec, 1, lgl);
  SET_VECTOR_ELT(vec, 2, ints);

  UNPROTECT(4);
  return vec;
')
dummy()
#> [[1]]
#> [1] 1.584e-316 1.693e-316
#> 
#> [[2]]
#>  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
#> 
#> [[3]]
#>  [1] 0 1 0 1 0 0 0 1 0 0

You might wonder what all the PROTECT() calls do. They tell R that we're currently using each object, and not to delete it if the garbage collector is activated. (We don't need to protect objects that R already knows we're using, like function arguments.)

You also need to make sure that every protected object is unprotected. UNPROTECT() takes a single integer argument, n, and unprotects the last n objects that were protected. The number of protects and unprotects must match, if not, R will warn about a "stack imbalance in .Call". Other specialised forms protection needed in some circumstances: UNPROTECT_PTR() unprotects the object pointed to by the SEXP s, PROTECT_WITH_INDEX() saves an index of the protection location that can be used to replace the protected value using REPROTECT(). Consult the R externals section on garbage collection for more details.

If you run dummy() a few times, you'll notice the output is basically random. This is because allocVector() assigns memory to each output, but it doesn't clean it out first. For real functions, you'll may want to loop through each element in the vector and zero it out. The most efficient way to do that is to use memset:

zeroes <- cfunction(c(n_ = "integer"), '
  int n = asInteger(n_);
  SEXP out;

  PROTECT(out = allocVector(INTSXP, n));
  memset(INTEGER(out), 0, n * sizeof(int));
  UNPROTECT(1);

  return out;
')
zeroes(10);
#>  [1] 0 0 0 0 0 0 0 0 0 0

Missing and non-finite values

Each atomic vector has a special constant for getting or setting missing values:

  • INTSXP: NA_INTEGER
  • LGLSXP: NA_LOGICAL
  • STRSXP: NA_STRING

Missing values are somewhat more complicated for REALSXP because NAs are a special type of NaN, and there are also special values for positive and negative infinity. Use ISNA(), ISNAN(), and !R_FINITE() macros to check for missing, NaN or non-finite values. Use the constants NA_REAL, R_NaN, R_PosInf and R_NegInf to set those values.

We can use this knowledge to make a simple version of is.NA():

is_na <- cfunction(c(x = "ANY"), '
  SEXP out;
  int n = length(x);

  PROTECT(out = allocVector(LGLSXP, n));

  for (int i = 0; i < n; i++) {
    switch(TYPEOF(x)) {
      case LGLSXP:
        LOGICAL(out)[i] = (LOGICAL(x)[i] == NA_LOGICAL);
        break;
      case INTSXP:
        LOGICAL(out)[i] = (INTEGER(x)[i] == NA_INTEGER);
        break;
      case REALSXP:
        LOGICAL(out)[i] = ISNA(REAL(x)[i]);
        break;
      case STRSXP:
        LOGICAL(out)[i] = (STRING_ELT(x, i) == NA_STRING);
        break;
      default:
        LOGICAL(out)[i] = NA_LOGICAL;
    }
  }
  UNPROTECT(1);

  return out;
')
is_na(c(NA, 1L))
#> [1]  TRUE FALSE
is_na(c(NA, 1))
#> [1]  TRUE FALSE
is_na(c(NA, "a"))
#> [1]  TRUE FALSE
is_na(c(NA, TRUE))
#> [1]  TRUE FALSE

It's worth noting that R's base::is.na() returns TRUE for both NA and NaNs in a numeric vector, as opposed to the C ISNA() macro, which returns TRUE only for NA_REALs.

Accessing vector data

There is a helper function for each atomic vector (apart from character, see following) that allows you to access the C array which stores the data in a vector. The following example shows the use of REAL() to inspect and modify a numeric vector:

add_one <- cfunction(c(x = "numeric"), "
  SEXP out;
  int n = length(x);

  PROTECT(out = allocVector(REALSXP, n));
  for (int i = 0; i < n; i++) {
    REAL(out)[i] = REAL(x)[i] + 1;
  }
  UNPROTECT(1);

  return out;
")
add_one(as.numeric(1:10))
#>  [1]  2  3  4  5  6  7  8  9 10 11

There are similar helpers for logical, LOGICAL(x), integer, INTEGER(x), complex, COMPLEX(x) and raw vectors RAW(x). If you're working with longer vectors, there's a performance advantage to using the helper function once and saving the result in a pointer:

add_two <- cfunction(c(x = "numeric"), "
  SEXP out;
  int n = length(x);
  double *px, *pout;

  PROTECT(out = allocVector(REALSXP, n));

  px = REAL(x);
  pout = REAL(out);
  for (int i = 0; i < n; i++) {
    pout[i] = px[i] + 2;
  }
  UNPROTECT(1);

  return out;
")
add_two(as.numeric(1:10))
#>  [1]  3  4  5  6  7  8  9 10 11 12

library(microbenchmark)
x <- as.numeric(1:1e6)
microbenchmark(
  add_one(x),
  add_two(x)
)
#> Unit: milliseconds
#>        expr    min    lq median    uq   max neval
#>  add_one(x) 10.879 12.39  15.86 20.40 106.6   100
#>  add_two(x)  9.151 11.04  13.27 17.76 109.1   100

On my computer, add_two is about twice as fast as add_one for a million element vector. This is a common idiom in R source code.

Character vectors and lists

Strings and lists are more complicated because the individual elements are SEXPs. The elements of a STRSXP are all CHARSXPs, which are immutable objects that contains a C string and a stored in a global pool. Use STRING_ELT(x, i) to extract the CHARSXP, and CHAR(STRING_ELT(x, i)) to get the actual const char* string. Set values with SET_STRING_ELT(x, i, value). Use mkChar() to turn a C string into a CHARSXP.

The following function shows how to make a character vector containing known strings:

abc <- cfunction(NULL, '
  SEXP out;
  PROTECT(out = allocVector(STRSXP, 3));

  SET_STRING_ELT(out, 0, mkChar("a"));
  SET_STRING_ELT(out, 1, mkChar("b"));
  SET_STRING_ELT(out, 2, mkChar("c"));

  UNPROTECT(1);

  return out;
')
abc()
#> [1] "a" "b" "c"

Things are a little harder if you want to modify the strings in the vector because you need to know a lot about string manipulation in C (which is hard, and harder to do right). For any problem that involves any kind of string modification, you're better off using Rcpp.

The elements of a list can be any other SEXP, which generally makes them hard to work with in C (you'll need lots of switch statements to deal with the possibilities). The accessor functions for lists are VECTOR_ELT(x, i) and SET_VECTOR_ELT(x, i, value).

Modifying inputs

You must be very careful when modifying function inputs. The following function has some very unexpected behaviour:

add_three <- cfunction(c(x = "numeric"), '
  REAL(x)[0] = REAL(x)[0] + 3;
  return x;
')
x <- 1
y <- x
add_three(x)
#> [1] 4
x
#> [1] 4
y
#> [1] 4

Not only has it modified the value of x, but it has also modified y! This happens because of R's lazy copy-on-modify semantics. To avoid problems like this, always duplicate() inputs before modifying them:

add_four <- cfunction(c(x = "numeric"), '
  SEXP x_copy;
  PROTECT(x_copy = duplicate(x));
  REAL(x_copy)[0] = REAL(x_copy)[0] + 4;
  UNPROTECT(1);
  return x_copy;
')
x <- 1
y <- x
add_four(x)
#> [1] 5
x
#> [1] 1
y
#> [1] 1

If you're working with lists, use shallow_duplicate() to do a shallow duplication that doesn't also duplicate every individual element.

Coercing scalars

There a few helper functions to turn length one R vectors into a C scalars:

  • asLogical(x): INTSXP -> int
  • asInteger(x): INTSXP -> int
  • asReal(x): REALSXP -> double
  • CHAR(asChar(x)): STRSXP -> const char*

And similarly it's easy to turn a C scalar into a length-one R vector:

  • ScalarLogical(x): int -> LGLSXP
  • ScalarInteger(x): int -> INTSXP
  • ScalarReal(x): double -> REALSXP
  • mkString(x): const char* -> STRSXP

These all create R-level objects, so need to be PROTECT()ed.

Pairlists

In R code, there are very places where you need to care about the difference between a pairlist and a list (as described in Pairlists). Working with pairlists is much more important for C code because they are used for calls, unevaluated arguments, attributes and in .... There are two primary differences between lists and pairlists at the C level: how you access and name elements.

Unlike lists (VECSXPs), pairlists (LISTSXPs) have no way to index into an arbitrary location. Instead, R provides a set of helper functions that navigate along a linked list. The basic helpers are CAR() which extracts the first element of the list, and CDR() which extracts the rest. These can be composed to get CAAR(), CDAR(), CADDR(), CADDR() and so on. As well as the getters, R also provides SETCAR(), SETCDR() etc, to modify elements of a pairlist.

The following example shows how CAR() and CDR() pull out pieces of a quoted function call:

car <- cfunction(c(x = "ANY"), 'return CAR(x);')
cdr <- cfunction(c(x = "ANY"), 'return CDR(x);')
cadr <- cfunction(c(x = "ANY"), 'return CADR(x);')

x <- quote(f(a = 1, b = 2))
# The first element
car(x)
#> f
# Second and third elements
cdr(x)
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2
# Second element
car(cdr(x))
#> [1] 1
cadr(x)
#> [1] 1

Pairlists are always terminated with R_NilValue, so to loop over all elements of a pairlist, use a for loop as follows:

count <- cfunction(c(x = "ANY"), '
  SEXP el, nxt;
  int i = 0;

  for(nxt = x; nxt != R_NilValue; el = CAR(nxt), nxt = CDR(nxt)) {
    i++;
  }
  return ScalarInteger(i);
')
count(quote(f(a, b, c)))
#> [1] 4
count(quote(f()))
#> [1] 1

You can make new pairlists with CONS() and new calls with LCONS(). Remember to set the last value to R_NilValue.

new_call <- cfunction(NULL, '
  SEXP out;

  out = LCONS(install("+"), LCONS(
    ScalarReal(10), LCONS(
      ScalarReal(5), R_NilValue)
    )
  );
  return out;
')
new_call();
#> 10 + 5

TAG() and SET_TAG() allow you to get and set the tag (aka name) associated with an element of a pairlist. The tag should be a symbol. To create a symbol (the equivalent of as.symbol() or as.name() in R), use install().

Attributes are also pairlists, but come with the helper functions setAttrib() and getAttrib() to make access a little easier:

set_attr <- cfunction(c(obj = "ANY", attr = "character", value = "ANY"), '
  const char* attr_s = CHAR(asChar(attr));

  duplicate(obj);
  setAttrib(obj, install(attr_s), value);
  return obj;
')
x <- 1:10
set_attr(x, "a", 1)
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> attr(,"a")
#> [1] 1

(Note that setAttrib() and getAttrib() must do a linear search over the attributes pairlist.)

There are some (confusingly named) shortcuts for common setting operations: classgets(), namesgets(), dimgets() and dimnamesgets() are the internal versions of the default methods of class<-, names<-, dim<- and dimnames<-.

Input validation

If the user provides unexpected input to your function (e.g. a list instead of a numeric vector), it's very easy to crash R. For this reason, it's a good idea to write a wrapper function that checks arguments are of the correct type, or coerces them if necessary. It's usually easier to do this at the R level. For example, going back to our first example of C code, we might rename it to add_ and then write a wrapper function to check the inputs are ok:

add_ <- cfunction(signature(a = "integer", b = "integer"), "
  SEXP result;

  PROTECT(result = allocVector(REALSXP, 1));
  REAL(result)[0] = asReal(a) + asReal(b);
  UNPROTECT(1);

  return result;
")
add <- function(a, b) {
  stopifnot(is.numeric(a), is.numeric(b), length(a) == 1, length(b) == 1)
  add_(a, b)
}

Or if we wanted to be more accepting of diverse inputs:

add <- function(a, b) {
  a <- as.numeric(a)
  b <- as.numeric(b)

  if (length(a) > 1) warning("Only first element of a used")
  if (length(b) > 1) warning("Only first element of b used")
  
  add_(a, b)
}

To coerce objects at the C level, use PROTECT(new = coerceVector(old, SEXPTYPE)). This will return an error if the SEXP can not be converted to the desired type.

To check if an object is of a specified type, you can use TYPEOF, which returns a SEXPTYPE:

is_numeric <- cfunction(c("x" = "ANY"), "
  return ScalarLogical(TYPEOF(x) == REALSXP);
")
is_numeric(7)
#> [1] TRUE
is_numeric("a")
#> [1] FALSE

Or you can use one of the many helper functions. They all return 0 for FALSE and 1 for TRUE:

  • For atomic vectors: isInteger(), isReal(), isComplex(), isLogical(), isString().

  • For combinations of atomic vectors: isNumeric() (integer, logical, real), isNumber() (integer, logical, real, complex), isVectorAtomic() (logical, interger, numeric, complex, string, raw)

  • Matrices (isMatrix()) and arrays (isArray())

  • For other more esoteric object: isEnvironment(), isExpression(), isList() (a pair list), isNewList() (a list), isSymbol(), isNull(), isObject() (S4 objects), isVector() (atomic vectors, lists, expressions).

Note that some of these functions behave differently to the R-level functions with similar names. For example isVector() is true for atomic vectors, lists and expressions, where is.vector() is returns TRUE only if its input has no attributes apart from names.

Finding the C source code for a function

To read the source for an arbitrary function, you'll first need a copy of the R source code. You can get it from CRAN, or try one of the github mirrors:

Once you've got the source code on disk, there are three steps:

  1. Look at the source code for the R function and find the name of the corresponding C function. For example, the source code for findInterval() contains .Internal(findInterval(...)) so that C function name is findInterval(). (It's common for the R and C function names to be the same.)

  2. Open src/main/names.c and search for the name of the function. You'll find an entry that tells you the name of the function that's actually called. It always starts with do_ and is lowercase. For example, findInterval() is on line 902 and calls internal function do_findinterval().

  3. Next, search the R source code for that name. To make it easier to find where it's defined (rather than everywhere it's used), you can add (SEXP. For example, to find where do_findinterval() is defined, search for do_findinterval(SEXP.

Interval and primitive functions have a somewhat different interface to .Call functions. They all have four arguments:

  • SEXP call: the complete call to the function. CAR(call) would give the name of the function (as a symbol).

  • SEXP op: many an optional argument used to distinguish between multiple R functions that use the same C function. For example do_logic() implements &, | and !. This value is the third column in names.c.

  • SEXP args a pairlist containing the unevaluated arguments to the function.

  • SEXP rho the environment in which the call was executed.

This gives internal functions an incredible amount of flexibility as to how and when the arguments are evaluated. For example, internal S3 generics call DispatchOrEval() which either calls the appropriate S3 method or evaluates all the arguments in place. This flexibility does come at a cost which makes the code harder to understand, but usually evaluating the arguments is the first step and the rest of the function is more straightforward.