Hi there,
I’m having a problem getting the modes of my posterior probability distributions lining up to what I’d expect them to be, given the data I am learning from. I am using the package cmdstanpy.
I have a generated dataset shown in the x/y plot below:
Here, y is generated according to y = 2x + 1, with some additional Gaussian noise.
x is generated using normal distributions with ‘known’ means and variances. The screenshot of python code below describes how this is generated:
For clarity, wn_calc is equivalent to y, and stiff_calc is equivalent to x.
With my hierarchical model (written in Stan), I am trying to relearn the ‘known’ parameters of the distributions from the data generation process, given the observations, and prior distributions over these ‘known’ parameters. The code for the Stan model is as below.
data {
int<lower=0> N; // number of data
vector[N] wn; // variates
array[N] int I; // index
int<lower=0> K; // number of domains
vector[2] c; // coefficients
vector[2] sigma0; // population noise
vector[2] mu_s0; // population mean
vector<lower=0>[2] sig_s0; // population variance
}
parameters {
real<lower=0> sigma; //noise
vector[K] s_k;
real mu_s;
real<lower=0> sig_s;
}
model {
// priors/hyperpriors
mu_s ~ gamma(mu_s0[1],mu_s0[2]);
sig_s ~ gamma(sig_s0[1],sig_s0[2]);
sigma ~ gamma(sigma0[1],sigma0[2]);
for (k in 1:K) {
s_k[k] ~ normal(mu_s, sig_s);
}
vector[N] mu;
for (n in 1:N) {
mu[n] = c[1]*s_k[I[n]] + c[2];
}
wn ~ normal(mu, sigma); // likelihood
}
A plot of the learned parameters is shown below, to which I have added the expected values calculated directly from the generated data with vertical lines.
There are a number of problems with this:
- the variance of the population mean seems too small given the data
- the mode of the population standard deviation is much too high compared to the expected value
- the noise is also too high
I feel like there must be a problem with my understanding/how I have formulated the Stan model, although I seem to get the same result in PyMC. I have also tried z-transforming the data but get the same result, as well as generating the data via Stan - again the same result.
Any ideas/advice would be greatly appreciated!