Linear Model with high dimensional parameter space

Hi!
I’m trying to run a Linear model with high number of paramters (120) on a simulated data set of 100 individuals.


data {
  int<lower=0> N; // number of data
  vector[N] y;    // response
  int<lower=0> p; //number of regressors
  matrix[N, p] X; //matrix of covariates
  
  // prior hyperparameters
  real c;
  vector[p] b0; 
  
  // prior parameters of gamma 
  real sigma0;
  real nu0;

}


parameters {
  vector[p] beta;       //regressors paramters (column vector)
  real<lower=0> sigma2;  // standard deviation
  vector<lower=0>[p] tau;
  corr_matrix[p] Omega; 
}

transformed parameters 
{
	vector[N] mu;         // mean 
     
	
	for(i in 1:N){
    mu[i] =  X[i,] * beta;
	}


}

model {
  //likelihood:
  y ~ normal(mu, pow(sigma2, 0.5));
  
  //Prior:
 
   Omega ~ lkj_corr(2); 
   beta ~ multi_normal(b0, quad_form_diag(Omega,tau));
   tau~cauchy(0,2.5);
   sigma2 ~ inv_gamma(nu0/2,nu0*sigma0/2 );
}

Unfortunately, it takes really long time to run, in fact it seems not to move from the initial state ever.

Can you help me understanding why do I have such problems of performance?

Thanks in advanceLM.R (3.1 KB) LM_120.stan (860 Bytes)

You can speed things up by using cholesky factored multinormal.
Good priors are important in high dimensional models. The so-called noninformative priors are not very good, in particular I doubt that that cauchy is going to work.

Here’s an example how the model could be rewritten.

data {
  ...
}
parameters {
  vector[p] beta;       //regressors paramters (column vector)
  real<lower=0> sigma;  // standard deviation
  vector<lower=0>[p] tau;
  cholesky_factor_corr[p] L_Omega;
}
transformed parameters {
	vector[N] mu = X * beta;
}
model {
  y ~ normal(mu, sigma);
  
  L_Omega ~ lkj_corr_cholesky(2); 
  beta ~ multi_normal_cholesky(b0, diag_pre_multiply(tau, L_Omega));

  tau ~ exponential(1);
  sigma ~ exponential(1);
}
generated quantities {
  real sigma2 = sigma^2;
  matrix[p,p] Omega = tcrossprod(L_Omega);
}
4 Likes

Are you assuming most covariates relevant and correlating, or only a small number of covariates relevant (and either independent or correlating)?

This would be much faster with

model {
 y ~ normal_id_glm(X, 0.0, beta, sigma);
1 Like

I want to test stan performance when I have higher number of parameters than data. so I’m considering all the parameters relevant

What do mean by parameters? In Stan parameters are defined in parameters block. The number of parameters can be different from the number of covariates. I asked what are your assumptions on the covariates. If you want lot of relevant parameters , then for example the number of parameters with regularized horseshoe prior is more than double the number of covariates. If you want even more parameters you can use alternative regularize horseshoe parameterization for which the number of parameters is more than triple the number of covariates, and all parameters are relevant.

What do you mean by performance? Performance as 1) computation time, 2) effective sample size per second, 3) predictive performance, 4) accuracy of recovering true parameter values (you said you have simulated data).

sorry, for parameters I was refering to the regression coefficients, namely the betas, which coincides with the number of covariates including the intercept.

I’m testing performance mainly looking at the index: effective sample size over running time.

I’m running now the model with the changes that you suggested and is much faster.

Anyway I think that I didn’t get what you are saying about the regularized horseshoe prior. Is it a suggestion to seed up things even more?

Prior discussion is relevant to whether you are testing effective sample size per second for a sensible model. I was trying to figure what type of prior to recommend to be used, so that modeling results would also be sensible. It’s easier to get high effective sample size per second for a model where you choose the model structure and priors based on computational aspects. In this case, if you just want more speed, then drop Cholesky things and use normal_std() prior for beta. You get further speed by using a prior which is finite when going towards zero, e.g. half-normal for tau and sigma, and likely to get even more speed-up by using zero-avoiding priors like log-normal or gamma with specific parameters. More Gaussian you can make the posterior, faster the sampling will be.

One more thing which can make it have even better ESS/s is to use non-centered parameterization.

Thanks for your advices!

Now I’m running the following model:

parameters {
  vector[p] beta_raw;       //regressors paramters (column vector)
  real<lower=0> sigma;  // standard deviation
  real<lower=0> sigma_beta;
  real mu_beta;
 
}
transformed parameters {
  vector[p] beta;
  // implies: beta ~ normal(mu_beta, sigma_beta)
  beta = mu_beta + sigma_beta * beta_raw;
}
model {
 y ~ normal_id_glm(X, 0.0, beta, sigma);
  
  beta_raw ~ std_normal();

  sigma ~ lognormal(0,0.5);
}
generated quantities {
  real sigma2 = sigma^2;
}
} 

with 500000 iteration and 250000 of burnin, 2 chains.
It is faster (11824.65 sec elapsed) but the effective sample size is still quite low and there are some divergent transiotions.

Any other suggestion besides running the model for more iterations and maybe increasing the adapt_delta over 0.8?

I’m really surprised that ESS would be low. Can you tell what you man by low ESS? Can you tell n_leapfrog per iteration? Is your data centered?

The lowest effective sample size is 275 (total post warm up draws are 50000).
the covariates are simulated from a N(0,1) and the responses y are simulated form a N( X^T \beta, 5) and then centred so that their mean, at the end, is almost zero.

About the n_leapfrog I didn’t check how many they are at each iteration but I got the fllowing warning message:

 There were 46527 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10.

That is ridiculously low for this problem.

I now noticed you has earlier posted R-file, which has very different code for simulating X which would affect the results a lot. Is your latest results with different code? You write N(…, 5), but simulate using from N(…, sqrt(5)). Is your beta still beta <- seq(1,K,1)? The largest beta is 1) 120 larger than the smallest beta, 2) 120 larger than prior sigma, and 3) 54 times bigger than noise sigma, which means that 1+2) you have big prior vs data conflict and it’s not solved just by making the prior wider, 3) the likelihood is very informative and non-centered parameterization is not that good (when recommending that I didn’t realize how the data is simulated). If the data simulation matches well your real problem then I can give new recommendations for prior and model.

You can do also as recommended. Also it would be better to run 4 chains. You can probably run shorter chains.

yes I’ve changed my data. What I have now is

N=100
K=120

true regression parameters

beta ← seq(1,K,1) #[1 2 3 4 …]

Matrix X of covariates

X ← matrix(nrow=N, ncol=K)
X[,1] ← rep(1,N) #intercept
for(i in 2:K){
X[,i] ← rnorm(n = N, mean = 0 ,sd = 1) #covariates
}

Y: response

Y ← rep(0,N)
for (i in 1:N){
Y[i] ← rnorm(n=1, mean=X[i,]%*%beta,sd=5^0.5)
}

data=Y-mean(Y) #data in stan model

Should I center also the covariates after simulating them?
If the non-centered parameterization is not that good what do you suggest?

Is beta still seq(1,K) ?

I’ll make some tests with your simulated data and report back

Yes, it is
But this is not actually important to me. I chose these values just for semplicity, I didn’t think on how these values would have impact on the choice of the prior. If this would make the code faster I can also change them.

The latest model does not have a prior for sigma_beta. With only 100 data points arbitrarily large betas can be consistent with the data as long as they are correlated just right. The implicit flat prior puts most probability on these huge, correlated values and therefore makes efficient sampling impossible.

Good catch, I should have noticed that, too. This is still challenging with that added.

Tested. Adding that prior makes it work fast. Non-centered works better. I get average Bulk-ESS>6000 and average Tail-ESS>2700 when running 4 chains with 1000+1000 iterations. With my laptop it takes about 3s.

As you subtract just mean from Y, but don’t do anything else, the sd(Y)=892 and thus the noise sd=sqrt(5) is really really small compared to that. The question is do you assume that ratio of variance of Y and noise variance is 159202? If you would scale Y to have sd(Y)=1, it would change your noise sd to be 0.0025, that is super small noise. Is this similar to real noise you would expect to see?

My final aim is just to compare effective sample size for second reached with stan and JAGS for the same model. So I can assume a super small noise as long as I do the same for the model in JAGS. So that is fine for my objective.

However scaling Y my code run faster but the estimations are really far from the true values of betas.