I am able to run latent variable Gaussian process model successfully on several data sets (A or B) separately or their combination (A+B). However, when I change variance model from y\sim normal(f,\sigma) to y\sim normal(f,\sigma_1+\sigma_2*y*(1-y)) I am able to run the model on A but not on B. Is there a good way to find out why?
In this case f\sim multivariate\_normal(0,K(X|\alpha,\rho) where X is design matrix (inputs), K is squared exponential kernel, and \alpha,\rho parameters. I am using Cholesky decomposition.
I’d suggest starting by working out whether the model is divergent, unidentifiable or multimodal .
The first may be flagged by divergent transitions and failure to converge in at least one chain.
The second may the flagged by chains which all converge but on different values for different parameters. Which parameters chain should don’t agree on may be a useful clue.
The third … I’m not sure … in low dimensional problems one can tell by drawing a graph of the parameters log density. I don’t know for when there is a tonne of parameters. It probably looks a lot like case two on the face of it.
The problem is that model runs for data set A and gives good results. When I run model with data set B or A+B the sampler stops at 0%. I can save warmups to the file but how to figure out from warmups what is going on.
The warmup interactions should be convergent … HMC allegedly discovers the “typical set” pretty early on so you should be able to see something about the direction in which each chain is going or not going from the warm ups so far as I know.
Is dataset B qualitatively different ? Does it contain zeros or negative numbers or something that isn’t so abundant in A? It may be useful to consider how the parameters interact with the data. Say you half the dataset B; what happens? How small
can you get dataset B and still reproduce the behaviour? If you can get the data to something very small you may be able to think about why it does what it does more easily.
There is a physical justification. The model with constant variance works (with darn good predictions) for A, B, and A+B and with various factor combinations (columns of X). Thus I would attribute this to data set B being incompatible with non constant variance model. I think that cmdstan just hanged on me. I would like to find out why.
Looking at warm up helped. The variance model I was trying to implement was \sigma_1-\sigma_2*y*(y-2*y_{max}) - physics tells that variance is lowest at the edges of [0, 1] and highest at y_{max}. Sampler stopped when draw of \sigma_1 became 1e-5. However, the mixing for \sigma_2 is rather good
I think the culprit was how I implemented the variance model. Any suggestions how to implement variance “hump”.
You could try and artificially constrain sigma to be a small distance above zero at least to see if 1e-5 is relevant or not or if it happens anyway. May be useful to share the graph?
So y is Bernoulli distributed? If you write y * (1-y) then this its variance, but you fitting a normal distribution, where the scale parameter is the sqrt of the variance.
In my 2 cents, the closest that would make sense is the use of a beta distribution.