Greta package?


#1

I think I remember reading about a program for setting up statistical models that kind of worked directly in R. But I can’t remember what it’s called. Or maybe I just made this up? To set up a linear regression model of a variable y on x, you would write an R script along the lines of…

library(AwesomeStatisticalModelingLanguage)

data <- read_data('my_file.csv')
y <- data$y
x <- data$x

y <- rnorm(x*beta, sigma^2)

And it would output posteriors on beta and sigma. There’s something wrong about the way I’ve written it since y is multiply defined, but I think the general idea is right.


#2

I think you are trying to describe the greta R package.


#3

wow – that was quick and that’s definitely the right answer.

How do people feel about greta? Is it appropriate for complex models? Does it improve scalability to big datasets? Is it easier to reason and communicate about models written in greta/R rather than those written in other languages (STAN/BUGS/JAGS)?


#4

Well, you are asking on a forum for people who use Stan 99% of the time, so I would imagine the answers you get to your questions are pretty predictable.

The main difference between the TensorFlow based HMC stuff and Stan is that Stan uses a greedy integration step where the leapfrogger goes until it U-turns (or hits max_treedepth), which makes the number of function evaluations non-deterministic. TensorFlow uses a fixed integration time for the leapfrogger, which allows them to optimize the speed of each function evaluation. Stan people know that dynamic integration time is critical for sampling from distributions that have tails that are heavier than normal, which is often the case particularly with more complex models. So, there are a set of relatively easy problems that can be handled with static HMC (if you get the fixed number of steps right), but those problems can also be handled with Stan’s dynamic HMC algorithm and usually those models can be specified easily with rstanarm or brms.


#5

There’s also NIMBLE, which is another embedded in R graphical modeling language. NIMBLE hasn’t implemented HMC as far as I know.

I have to say it’s a pretty nice interface dispalyed in Greta’s Hello World example. Some of their comparisons to Stan seem unfair in that they’re adding an external piece of information in model(..params..) specifying what the parameters are and also external pieces of information in defining the data externally in order to define their types (presumably this is necessary, but I could be wrong here). They also include generated quantities in the Stan example they don’t calculate in Greta (e.g, in the Beetles example).

I just installed Greta. Wow, that was really really easy. It even installed all the Python tensorflow stuff. And it worked:

> x <- iris$Petal.Length
> y <- iris$Sepal.Length
> int <- normal(0, 5)
> coef <- normal(0, 3)
> sd <- lognormal(0, 3)
> mean <- int + coef * x
> distribution(y) <- normal(mean, sd)
> m <- model(int, coef, sd)
> draws <- mcmc(m, n_samples = 1000, chains = 3)
Error in mcmc(m, n_samples = 1000, chains = 3) : 
  unused argument (chains = 3)

This is a bug in their tutorial example. Removing the number of chains specification lets it run:

> draws <- mcmc(m, n_samples = 1000)
    warmup ====================================   100/100 | eta:  0s          
  sampling ==================================== 1000/1000 | eta:  0s     

It even has a cool countdown counter (I wonder if they found a way to make the overwriting output portable to terminal, RStudio and GUI). We need something like this.

I then ran it again for timing. Greta took nearly a minute to do one chain with 100 warmup iterations:

> start_time <- Sys.time(); draws <- mcmc(m, n_samples = 1000); end_time <- Sys.time()
    warmup ====================================   100/100 | eta:  0s          
  sampling ==================================== 1000/1000 | eta:  0s          
> end_time - start_time
Time difference of 54.61541 secs

Looking at the output, it’s indeed 1000 draws from a single chain.

Now let’s compare Stan’s version of the same model:

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
joint <- stan_model("~/temp2/petals.stan")

Ouch, this takes like 45s.

N <- length(x)
posterior <- sampling(joint, data = list(N = N, x = x, y = y))

This takes less than a second to actually do the sampling with four chains of 1000 warmup iterations and 1000 sampling iterations.

Maybe I was supposed to do something else to make Greta fast like buy a computer with a real GPU or dozens of cores? This is on my now ancient quad-core i7 MacBook Pro.

Oh, and here’s the posterior Stan got for the slope, intercept, and error scale.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha  4.31    0.00 0.08  4.15  4.26  4.31  4.36  4.46  1724    1
beta   0.41    0.00 0.02  0.37  0.40  0.41  0.42  0.45  1568    1
sigma  0.41    0.00 0.02  0.37  0.39  0.41  0.42  0.46  1903    1