# Fitting a latent variable model

Hi, I have a censored data and I am trying to fit a hierarchical model with latent variable(Z). Here is the model
Z_{ij} \sim Bernoulli( P_z ).
\tilde{Y}_{ij} \sim Normal(\theta_i, \sigma).
Y_{ij} = M if Z_{ij} = 1 or \tilde{Y}_{ij} > M.
Y_{ij} = \tilde{Y}_{ij} if Z = 0 or \tilde{Y}_{ij} \leq M.
The model consider two types of censoring. Any observation has certain chance (P_z) to be censored. The other type of censoring is when the value gets too large.

When fitting the model I put a beta prior on P_z; a Gamma prior on \sigma and a normal prior on \theta_i's.

Here is the code where I used to generate the data.

i_size = 3
j_num = 500
sample_size = i_size * j_num
p_z = 0.15
M = 40
sigma = 1

np.random.seed(1)
z_list = np.random.binomial(n=1, p=p_z, size=sample_size)
theta_list = np.random.uniform(10, 40, size=i_size)
y_ct_list = np.random.normal(loc=np.repeat(theta_list, j_num), scale=sigma, size=sample_size)
y_list = np.repeat(np.nan, sample_size)

observed_boolean_list = (z_list == 0) & (y_ct_list <= M)
y_list[observed_boolean_list] = y_ct_list[observed_boolean_list]
x_array = np.repeat(np.arange(i_size), j_num)
x_mat = np.zeros((x_array.size, x_array.max()+1))
x_mat[np.arange(x_array.size), x_array] = 1


When I tried to fit the model I described above I tried to fit the equivalent mixture model instead.

data {
int<lower=0> N_cens;
int<lower=0> N_obs;
int<lower=1> i_size;
real M;
vector<lower=0>[N_obs] y_obs;
matrix[N_obs, i_size] x_obs;
matrix[N_cens, i_size] x_cens;
}
parameters {
real<lower=0, upper=1> p_z_cens;
real<lower=0> sigma;
vector<lower=0>[i_size] theta;
}
model {
sigma ~ gamma(0.2, 0.1);
theta ~ normal(20, 7);
p_z_cens ~ beta(2, 5);
for (i in 1:N_obs) {
target += log(1 - p_z_cens) + normal_lpdf(y_obs[i] | x_obs[i] * theta, sigma);
}
for (i in 1:N_cens) {
target += log_mix(p_z_cens , 0, normal_lccdf(M | row(x_cens, i) * theta, sigma));
//target += log_sum_exp(log(p_z_cens),log(1 - p_z_cens) + normal_lccdf(censored_ct | x_cens[i] * theta, sigma));
}
}


To explain what happens in the model block, when the data is not censored the likelihood f(y_{ij}) is
(1 - P_z) * normal\_pdf(y_{ij}). When the data is censored the likelihood is P(Y_{ij} = M) = P_z * 1 + (1 - P_z) * (1 - Normal\_cdf(M)).

Here is how I code the model_data.

model_data = {"N_cens": int(sum(np.isnan(y_list))), "N_obs": int(sum(~np.isnan(y_list))), "i_size": i_size,
"M": M, "y_obs": y_list[~np.isnan(y_list)],
"x_obs": x_mat[~np.isnan(y_list), :],
"x_cens": x_mat[np.isnan(y_list), :]}


There are several problems I encounter.

1. The warning message says â€śThe parameter p_z_cens has 2 priors.â€ť which I donâ€™t see why it is the case.
2. The trace plots of \sigma and P_z donâ€™t appear to be random when the number of replicates for each I(j_size) gets large. For example, one trace plot I got is
3. The posterior distributions are not concentrating more and more around the true value as the sample size gets large. For \sigma, itâ€™s completely off.

If anyone can tell me what I did wrong (ie. I didnâ€™t use the correct likelihood or error in Stan code), I really appreciate it. Thanks!

Hi and welcome. Sorry for the long wait. This type of model is not my area but this post will bump your question back to the top.

Thanks. I was able to fit the model with pytwalk. However, I still have no clue how to fit with Stan.

Could you post your model call as well? Unless I missed it.

I built a class as a wrapper to use the Stan package. Here is the essence of the class.

class StanModel:
def __init__(self, model_code, model_data):
self.model_code = model_code
self.model_data = model_data
self.posterior_dt = None

def fit_posterior(self, random_seed=1, num_chains=1, num_samples=10000, **kwargs):
posterior = stan.build(self.model_code, data=self.model_data, random_seed=random_seed)
fit = posterior.sample(num_chains=num_chains, num_samples=num_samples, **kwargs)
self.posterior_dt = fit.to_frame()


Here is how I called it.

bm_instance = bm.StanModel(model_code=model_code, model_data=model_data)
bm_instance.fit_posterior(random_seed=1, num_chains=1, num_samples=mcmc_trials)

1 Like

Just to start with things I know that are sticking out to me.

num_chains = 1. Unless you are debugging sme chain errors I would set this to 4.

I would also turn off the random_seed for right now.

How many cores are you running?

I didnâ€™t set core number so it should the default number of the sample method. The reason I set num_chains = 1 is because I am not sure how to produce trace plots with multiple chains. As far as I know, I should plot trace from all chains separately. However right now, fit.to_frame() lumps MCMC samples from all chains into a single data frame and I cannot know which chain each sample belongs to. I am wondering if these MCMC samples in each column are ordered as [sampleS from chain 1, sampleS from chain 2 â€¦ sampleS from chain 4]. Itâ€™s not in the documentation (to the best of my knowledge) and there doesnâ€™t seem to be a built-in method for trace plots.