How to code a leave-one out NW estimator in a loop?

Hello Everyone,

I would like to reproduce an algorithm that using Bayesian methods to select the bandwidth of a local constant estimator under the context of the nonparametric regression. The basic idea is simply using a multivariate kernel which has the following form:
Where x is a p dimensional vector and H is a bandwidth matrix and for simplicity, just assume it is a diagonal matrix with dimensions of p*p.
The leave-one-out NW estimator is simply:

If we assume y_i \sim Normal (\hat {m}_{-i}(x_i), \sigma ^2) then we can actually write down the likelihood, and if we treat the h_{1,...,p} as unknown paramters we can actually get their point estimates by maximizing the likelihood. Now I just want to implement this in a Bayesian framework so I just write the following stan code:

  int	n_sample;				//	number	of	patients
  int p;                          // number of dimensions
  matrix[n_sample, p] X;              // covaraite matrix 
  real	Y[n_sample];			//	vector	of	responses
  real Yf[n_sample];   // same as y but for the loop 

parameters	{
  vector [p]	bandwidth;										//	bandwidth
  real  ysd;         //  the sd for y , assuming the sd is the same bewtween each treatment, i.e. homogeneity of varaince 

transformed parameters {
  matrix[p, p] D; 
  															//	latent	continuous	response
  D = diag_matrix(bandwidth);


model	{
  real m_hat[n_sample];	//	latent	continuous	responses
  matrix[n_sample,n_sample] density; 
  matrix[n_sample, n_sample] reg; 
  real denominator[n_sample];
  real numerator[n_sample];

  for	(j	in	1:p)	bandwidth[j]~	uniform(0.0001, 10000 );	//	prior	on	effect	sizes	of	A	and	B
  1/ysd ~ inv_gamma(1, 1);  // prior for the variance 
  for	(i	in	1:n_sample)	{
      for (j in 1: n_sample) {
            density[i,j] = (2 * 3.14)^(-p/2) *  sqrt(1/prod(bandwidth)) 
           * exp( (-1/2)        
           * ( (X[i,]-X[j,])  * inverse(D) * (X[i,]-X[j,])'   )  );   
            reg[i,j] = density[i,j] * Yf[j] ;
      denominator[i]=sum(density[i,]) -density[i,i];
      numerator[i] = sum(reg[i,])-reg[i,i];
      m_hat[i]= (numerator[i]) / (denominator[i]);
      Y[i] ~ normal (m_hat[i], ysd);

However, the issue is that I keep getting sampling rejections and the chain just get stuck there for hours… I think I get something wrong in the loop in the “model” part where I need to code the leave one out estimator. Can anyone give me some ideas about where I did wrong on this? Thanks!!

1 Like

there are a couple weird things in your Stan code:

You probably want to declare ysd as a positive number (to let it be used as standard deviation), i.e. real<lower=0> ysd;

When you use uniform priors, you need to definite appropriate lower and upper bounds for your parameters. However, we mostly discourage uniform priors as the hard bounds can pose problems in sampling. The best practice is to reserve hard bounds for bounds that are physical/mathematical, i.e. you certainly want bandwith to have a lower bound of zero, but the upper bound should be soft (i.e. a suitable prior that puts low mass above what you consider a reasonable boundary, but does not introduce a hard threshold). Additionally, uniform priors are usually not very defensible - I guess you could put a prior on your bandwidth that would say something like “I am quite certain the bandwidth is not smaller than the distance between the two closest points and not larger than the span of the whole dataset” (e.g. a suitably chosen inverse gamma prior could work well here).

The absence of bounds could easily explain the initialization problems.

Hope that helps at least a little!

Thanks, Martin!

I redefined the ‘ysd’ and the prior as suggested, but it seems the sampling is still stuck there: it just shows the following for hours:

SAMPLING FOR MODEL ‘adt_bandwidth’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 22.1072 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 221072 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 3000 [ 0%] (Warmup)

Any suggestions for this? Thanks!

I think this is key. It appears your dataset is quite large and computing even a single leapfrog step takes very, very long time. What is the n_sample you are working with? If on top of that, the geometry of the posterior is unfriendly, you may need large treedepth, which means computing potentially hunderds of leapfrog steps per each iteration. A bit of arithmetic shows that this is going to take a lot of time.

Before you try to spend a lot of time making the model more efficient, I would try running it with just a small (preferably simulated) simple dataset to see if it at least roughly works there and fulfills your needs. If yes, we can discuss some options to improve the speed.

Best of luck with your model!

Thanks, Martin! The sample size is only 2000 for this test and I reduced it to 200 and it will still take like 3hours to finish. After I change the uniform distribution to a gamma distribution, the sampling is easier but it seems the data does not change the posterior at all (I plot the posterior and it looks quite similar to the prior I gave)… So I guess there must be something wrong with my script?

Note that you are doing n_samples ^ 2 times the inner loop, se with 200 datapoints, you are still evaluating it 40000 times. So moving anything out of the inner loop would definitely help. You can precompute (2 * 3.14)^(-p/2) * sqrt(1/prod(bandwidth)) * exp( (-1/2) and inverse(D) out of the loop. I would expect that to dramatically improve the computation time. (also, inverse(D) should IMHO be equivalent to diag_matrix(inv(bandwidth)) which should also be substantially faster)

Additionally, 1/ysd ~ inv_gamma(1, 1); is probably not doing what you want as you are putting a prior on a derived quantity and would need a log-Jacobian adjustment (See 22.3 Changes of variables | Stan User’s Guide for more background). But in this case, it should be equivalent to ysd ~ gamma(1, 1); which is completely OK.

Not necessarily - maybe m_hat turns out to come out almost the same regardless of the choice of bandwidth? Or maybe some bandwidths fit badly, but can compensate by being associated with larger ysd, once again letting them have similar likelihood to better fitting bandwidths? I would probably try with some super small dataset (e.g. 5 - 10 data points) where you can easily play with the bandwidth by hand and get a feel on what “should” happen with different bandwidths and then see whether this really happens in the model.

Additionally, there might be numerical issues with the density underflowing to zero. Working with log densities and then using log_sum_exp / log_diff_exp instead of sum/subtraction could resolve numerical issues (but maybe they are not a problem here? once again - you would need to investigate, maybe but some print statements here to see how frequently you get exact zeros…?)

Hello Matrin,

Thanks a lot! I think I know where I am wrong!

1 Like