Multiple Regression in Rstan with factors

alexhallam · February 4, 2018, 3:19pm

I am a little confused about how to change the following code to return parameters for each factor as opposed to just the vector.

I have the following model

{stan, output.var=multi_regression}
data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
}
parameters {
real alpha;       // intercept
vector[K] beta;   // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}

and the following data

{r}
library(tidyverse)
library(rstan)
data(diamonds)

y <- select(diamonds,price) 
x <- select(diamonds,carat,cut) 
N <- nrow(diamonds) 
K <- 2

#try to speed things up  
rstan_options(auto_write = FALSE)
options(mc.cores = 4)

#multi_regression model
multi_reg_model <- sampling(multi_regression,
                      data=list(y=y$price,x=x,N=N,K=K))

rstan::summary(multi_reg_model)$summary

When I run

lm(price~carat+cut,data=diamonds)

I get

Coefficients:
(Intercept)        carat        cut.L        cut.Q        cut.C  
   -2701.38      7871.08      1239.80      -528.60       367.91  
      cut^4  
      74.59

but my Stan code returns

              mean
alpha       -3328.94
beta[1]      7838.09
beta[2]      257.99
sigma        1522.20

Clearly I am modeling the vector for cut as opposed to the factors of cut. What am I doing wrong?

bgoodri · February 4, 2018, 7:15pm

You should do

mf <- model.frame(price ~ carat + cut, data = diamonds)
y <- mf$price
x <- model.matrix(price ~ carat + cut, data = mf)

or at least that is basically what the rstanarm and brms R packages do when fitting models like this. If you want to get rid of those Helmert contrasts, you can first coerce cut to an unordered factor or call

options(contrasts = c(unordered = 'contr.treatment',
                        ordered = 'contr.treatment'))

alexhallam · February 4, 2018, 11:30pm

Thanks, with your suggestion I cam up with the following, which seems to work

{r}
# Prepare data to replicate the following
# lm(price~carat + cut, data=diamonds)


library(tidyverse)
library(rstan)
data("diamonds")
mf <- model.frame(price ~ carat + cut, data = diamonds) #needed for factors
y <- mf$price  
x <- model.matrix(price ~ carat + cut, data = mf) #also needed for factors
N <- nrow(diamonds) 
K <- 6  #number of parameters (remember to count factors)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
multi_reg <- sampling(multi_regression,
                      data=list(y=y,x=x,N=N,K=K))

rstan::summary(multi_regression)$summary

One thing I am surprised about is how long this takes to run. I am new to Bayesian computing, but I am experiencing over 30 minutes of wait time. Is this an average wait time?

I received the following output

1000 transitions using 10 leapfrog steps per transition would take 46.7 seconds.

bgoodri · February 4, 2018, 11:44pm

That Stan output is basically a lie so don’t pay any attention to it. But since that dataset is 50000+ observations and you coded the likelihood without using sufficient statistics, it can take a while for NUTS to finish.

alexhallam · February 5, 2018, 12:17am

@bgoodri Thanks. I was not aware that I could use sufficient statistics to speed up the process. Where can I learn more about this.

bgoodri · February 5, 2018, 12:26am

It is in the appendix to chapter 3 of Tony Lancaster’s textbook and implemented in

github.com

stan-dev/rstanarm/blob/master/src/stan_files/lm.stan

#include /pre/Columbia_copyright.stan
#include /pre/license.stan

// GLM for a Gaussian outcome with no link function
functions {
  /**
   * Increments the log-posterior with the logarithm of a multivariate normal 
   * likelihood with a scalar standard deviation for all errors
   * Equivalent to normal_lpdf(y | intercept + Q * R * beta, sigma) but faster
   * @param theta vector of coefficients (excluding intercept), equal to R * beta
   * @param b precomputed vector of OLS coefficients (excluding intercept) in Q-space
   * @param intercept scalar (assuming columns of Q have mean zero)
   * @param ybar precomputed sample mean of the outcome
   * @param SSR positive precomputed value of the sum of squared OLS residuals
   * @param sigma positive scalar for the standard deviation of the errors
   * @param N integer equal to the number of observations
   */
  real ll_mvn_ols_qr_lp(vector theta, vector b,
                        real intercept, real ybar,
                        real SSR, real sigma, int N) {

This file has been truncated. show original

betanalpha · February 5, 2018, 5:01am

No, it is an exactly true statement conditioned on the clearly stated conditions. In problems that require many more leapfrog steps per transition the runtime will be correspondingly larger.

bgoodri · February 5, 2018, 5:32am

That is why I said it is basically a lie. Technically true, but completely misleading for what people infer it is supposed to be used for, which is to project the runtime. We should get rid of it.

alexhallam · February 5, 2018, 1:01pm

@bgoodri Thanks again for the help. I am having difficulty following the code you linked to in github. As a simple example, what changes could I make to my current multiple regression model to speed it up. I think if I were able to see that then I would be able to apply the same pattern to other problems.

bgoodri · February 5, 2018, 3:06pm

This only works for linear, no-link, Gaussian models, but take this modified function to evaluate the log-likelihood

functions {
  /**
   * Increments the log-posterior with the logarithm of a multivariate normal 
   * likelihood with a scalar standard deviation for all errors
   * Equivalent to normal_lpdf(y | intercept + X * beta, sigma) but faster
   * @param theta vector of coefficients (excluding intercept)
   * @param XpX matrix equal to X'X (excluding intercept) with columns of X centered
   * @param b precomputed vector of OLS coefficients (excluding intercept)
   * @param intercept scalar (assuming columns of Q have mean zero)
   * @param ybar precomputed sample mean of the outcome
   * @param SSR positive precomputed value of the sum of squared OLS residuals
   * @param sigma positive scalar for the standard deviation of the errors
   * @param N integer equal to the number of observations
   */
  void ll_mvn_ols_lp(vector theta, matrix XpX, vector b,
                     real intercept, real ybar,
                     real SSR, real sigma, int N) {
    target += -0.5 * (quad_form(XpX, theta - b) + 
      N * square(intercept - ybar) + SSR) / 
      square(sigma) - N * (log(sigma) + log(sqrt(2 * pi())));
  }
}

You are going to need to split the intercept out of X, center the columns of X, calculate the OLS solution, the sample average of the outcome, the sum of squared residuals, etc. Also, calculate XpX as t(X) %*% X in R. You can pass all that to the function and it will use much less autodiff to calculate the derivatives of the likelihood function.

The QR transformation is talked about in the manual and in one of Mike’s case studies. It will usually help and applies to any GLM.

But really in this case, the model of diamond price does not fit that well, which is responsible for a lot of the slowdown. It is better to model log-price additively.

alexhallam · February 5, 2018, 4:38pm

Thanks again. All of your responses have been very helpful.

Maybe in situations like this, the following

fit <- MCMCpack::MCMCregress(price ~ carat + cut, data = diamonds)
summary(fit)

seems like a much better option in terms of saving programmer-time and computer-time.

I honestly did not expect this amount of coding to get this somewhat common type of regression to run in a sensible amount of computer time.

I am guessing that rstanarm is also a great fit for this kind of problem.

bgoodri · February 5, 2018, 4:58pm

MCMCpack::MCMCregress is using essentially the same tricks to pre-compute matrices of cross-products and stuff as rstanarm::stan_lm does. But NUTS is generally a better way to do MCMC than is Gibbs sampling and is always better than Metropolis-Hastings. If you compare MCMCpack::MCMClogit to either rstanarm::stan_glm or brms::brm with binomial link functions there is a big difference in the sampling efficiency.

alexhallam · February 6, 2018, 2:15am

That is good to know. Thanks for introducing me to the world of Stan and Bayesian computation.

Topic		Replies	Views
Check model conversion from rstanarm to rstan code Modeling rstan , specification , example-models , rstanarm	0	499	October 7, 2022
Displaying intercept in a multiple regression as beta0, not beta1 Modeling rstan	2	434	May 12, 2021
Help with understanding my results Modeling fitting-issues , specification , performance	5	1180	February 8, 2019
Getting different results when using target += Modeling rstan , fitting-issues	3	73	December 12, 2024
Problems adjusting Linear Regression at Stan Modeling rstan	3	799	September 29, 2020

Multiple Regression in Rstan with factors

Related topics