Multiple Regression in Rstan with factors

I am a little confused about how to change the following code to return parameters for each factor as opposed to just the vector.

I have the following model

{stan, output.var=multi_regression}
data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
}
parameters {
real alpha;       // intercept
vector[K] beta;   // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}

and the following data

{r}
library(tidyverse)
library(rstan)
data(diamonds)

y <- select(diamonds,price) 
x <- select(diamonds,carat,cut) 
N <- nrow(diamonds) 
K <- 2

#try to speed things up  
rstan_options(auto_write = FALSE)
options(mc.cores = 4)

#multi_regression model
multi_reg_model <- sampling(multi_regression,
                      data=list(y=y$price,x=x,N=N,K=K))

rstan::summary(multi_reg_model)$summary

When I run

lm(price~carat+cut,data=diamonds)

I get

Coefficients:
(Intercept)        carat        cut.L        cut.Q        cut.C  
   -2701.38      7871.08      1239.80      -528.60       367.91  
      cut^4  
      74.59

but my Stan code returns

              mean
alpha       -3328.94
beta[1]      7838.09
beta[2]      257.99
sigma        1522.20

Clearly I am modeling the vector for cut as opposed to the factors of cut. What am I doing wrong?

1 Like

You should do

mf <- model.frame(price ~ carat + cut, data = diamonds)
y <- mf$price
x <- model.matrix(price ~ carat + cut, data = mf)

or at least that is basically what the rstanarm and brms R packages do when fitting models like this. If you want to get rid of those Helmert contrasts, you can first coerce cut to an unordered factor or call

options(contrasts = c(unordered = 'contr.treatment',
                        ordered = 'contr.treatment'))
3 Likes

Thanks, with your suggestion I cam up with the following, which seems to work

{r}
# Prepare data to replicate the following
# lm(price~carat + cut, data=diamonds)


library(tidyverse)
library(rstan)
data("diamonds")
mf <- model.frame(price ~ carat + cut, data = diamonds) #needed for factors
y <- mf$price  
x <- model.matrix(price ~ carat + cut, data = mf) #also needed for factors
N <- nrow(diamonds) 
K <- 6  #number of parameters (remember to count factors)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
multi_reg <- sampling(multi_regression,
                      data=list(y=y,x=x,N=N,K=K))

rstan::summary(multi_regression)$summary

One thing I am surprised about is how long this takes to run. I am new to Bayesian computing, but I am experiencing over 30 minutes of wait time. Is this an average wait time?

I received the following output

1000 transitions using 10 leapfrog steps per transition would take 46.7 seconds.

That Stan output is basically a lie so don’t pay any attention to it. But since that dataset is 50000+ observations and you coded the likelihood without using sufficient statistics, it can take a while for NUTS to finish.

@bgoodri Thanks. I was not aware that I could use sufficient statistics to speed up the process. Where can I learn more about this.

It is in the appendix to chapter 3 of Tony Lancaster’s textbook and implemented in

1 Like

No, it is an exactly true statement conditioned on the clearly stated conditions. In problems that require many more leapfrog steps per transition the runtime will be correspondingly larger.

That is why I said it is basically a lie. Technically true, but completely misleading for what people infer it is supposed to be used for, which is to project the runtime. We should get rid of it.

@bgoodri Thanks again for the help. I am having difficulty following the code you linked to in github. As a simple example, what changes could I make to my current multiple regression model to speed it up. I think if I were able to see that then I would be able to apply the same pattern to other problems.

This only works for linear, no-link, Gaussian models, but take this modified function to evaluate the log-likelihood

functions {
  /**
   * Increments the log-posterior with the logarithm of a multivariate normal 
   * likelihood with a scalar standard deviation for all errors
   * Equivalent to normal_lpdf(y | intercept + X * beta, sigma) but faster
   * @param theta vector of coefficients (excluding intercept)
   * @param XpX matrix equal to X'X (excluding intercept) with columns of X centered
   * @param b precomputed vector of OLS coefficients (excluding intercept)
   * @param intercept scalar (assuming columns of Q have mean zero)
   * @param ybar precomputed sample mean of the outcome
   * @param SSR positive precomputed value of the sum of squared OLS residuals
   * @param sigma positive scalar for the standard deviation of the errors
   * @param N integer equal to the number of observations
   */
  void ll_mvn_ols_lp(vector theta, matrix XpX, vector b,
                     real intercept, real ybar,
                     real SSR, real sigma, int N) {
    target += -0.5 * (quad_form(XpX, theta - b) + 
      N * square(intercept - ybar) + SSR) / 
      square(sigma) - N * (log(sigma) + log(sqrt(2 * pi())));
  }
}

You are going to need to split the intercept out of X, center the columns of X, calculate the OLS solution, the sample average of the outcome, the sum of squared residuals, etc. Also, calculate XpX as t(X) %*% X in R. You can pass all that to the function and it will use much less autodiff to calculate the derivatives of the likelihood function.

The QR transformation is talked about in the manual and in one of Mike’s case studies. It will usually help and applies to any GLM.

But really in this case, the model of diamond price does not fit that well, which is responsible for a lot of the slowdown. It is better to model log-price additively.

1 Like

Thanks again. All of your responses have been very helpful.

Maybe in situations like this, the following

fit <- MCMCpack::MCMCregress(price ~ carat + cut, data = diamonds)
summary(fit)

seems like a much better option in terms of saving programmer-time and computer-time.

I honestly did not expect this amount of coding to get this somewhat common type of regression to run in a sensible amount of computer time.

I am guessing that rstanarm is also a great fit for this kind of problem.

MCMCpack::MCMCregress is using essentially the same tricks to pre-compute matrices of cross-products and stuff as rstanarm::stan_lm does. But NUTS is generally a better way to do MCMC than is Gibbs sampling and is always better than Metropolis-Hastings. If you compare MCMCpack::MCMClogit to either rstanarm::stan_glm or brms::brm with binomial link functions there is a big difference in the sampling efficiency.

1 Like

That is good to know. Thanks for introducing me to the world of Stan and Bayesian computation.