A mixture Gaussian model with censoring data

In my example, suppose that there are some observations
y_1,y_2,\dots,y_n \sim \pi N(\mu_1,\sigma_1)+(1-\pi)N(\mu_2,\sigma_2).

However, in the real experiment, we observe some $y_i$s with right censoring at y=K.

Now I get the data y_i with its censoring indicator I_i and want to estimate parameters \mu and \sigma. Obviously, it is a mixture Normal problem with censoring data.

For a certain y_i and the censoring indicator I_i, which I_i=1 means right censoring,

y_i \sim \pi ~f(y_i|\mu_1,\sigma_1)^{1-I_i}(1-F(y_i|\mu_1,\sigma_1))^{I_i}+(1-\pi)~f(y_i|\mu_2,\sigma_2)^{1-I_i}(1-F(y_2|\mu_1,\sigma_2))^{I_i}.

So, according to the density function above, here is my stan code:

User-defined functions
Here is some user-defined functions, which mean that
{fun}_1=f(y_i|\mu_1,\sigma_1)^{1-I_i}(1-F(y_i|\mu_1,\sigma_1)),
{fun}_2=f(y_i|\mu_2,\sigma_2)^{1-I_i}(1-F(y_i|\mu_2,\sigma_2)),
and
fun1\_lpdf=\log(fun_1),fun2\_lpdf=\log(fun_2).
This 3 formulas define the p.d.f of two mixture right-censoring normal distributions.

functions {
real fun1_log(real y,real ind ,real mu1,vector sigma){
return (1-ind)*normal_lpdf(y|mu1,sigma[1])+ind*log(1-normal_cdf(y,mu1,sigma[1]));

}
real fun2_log(real y,real ind ,real mu2,vector sigma){
return (1-ind)*normal_lpdf(y|mu2,sigma[2])+ind*log(1-normal_cdf(y,mu2,sigma[2]));
}
}


Parameters and Data

parameters {
real  mu1;
real  mu2;
vector<lower=0>[2] sigma;
real<lower=0, upper=1> p;
}
data {
int<lower = 0> N;
vector[N] y;
vector[N] index;
}


Model

model {
mu1~normal(50,2);
mu2~normal(120,4);
sigma[1] ~ inv_gamma(0.01, 0.01);
sigma[2] ~ inv_gamma(0.01, 0.01);
p ~ beta(1, 1);
for (n in 1:N)
target += log_mix(p,
fun1_log(y[n] ,index[n]| mu1, sigma[1]),
fun2_log(y[n], index[n]| mu2, sigma[2]));
}


But when I compile the stan code, some ERROR appear:

PARSER EXPECTED: whitespace to end of file.
FOUND AT line 18:

Error in stanc(file = file, model_code = model_code, model_name = model_name,  :
failed to parse Stan model '95bdbfeb015fa8187a5778b30cd86959' due to the above error.


This model compiles. I didn’t try to check the math – just wiggled things around so they’d compile:

functions {
real fun1_log(real y,real ind ,real mu1,real sigma){
return (1-ind)*normal_lpdf(y|mu1,sigma)+ind*log(1-normal_cdf(y,mu1,sigma));

}
real fun2_log(real y,real ind ,real mu2,real sigma){
return (1-ind)*normal_lpdf(y|mu2,sigma)+ind*log(1-normal_cdf(y,mu2,sigma));
}
}

data {
int<lower = 0> N;
vector[N] y;
vector[N] index;
}

parameters {
real  mu1;
real  mu2;
vector<lower=0>[2] sigma;
real<lower=0, upper=1> p;
}

model {
mu1~normal(50,2);
mu2~normal(120,4);
sigma[1] ~ inv_gamma(0.01, 0.01);
sigma[2] ~ inv_gamma(0.01, 0.01);
p ~ beta(1, 1);
for (n in 1:N)
target += log_mix(p,
fun1_log(y[n] ,index[n], mu1, sigma[1]),
fun2_log(y[n], index[n], mu2, sigma[2]));
}


The error you were getting was because the data block should be defined before parameters. That error message wasn’t very helpful so thanks for reporting. I’ll make a Github issue about this.

Also, I got rid of the vertical bars in the fun1_log calls. I’m not 100% sure the syntax rules on those, but the way you’re using those functions is just like normal functions so they’re not needed.

I also make the last argument to both functions a real instead of a vector (since they’re only being passed one sigma), and then updated the sigma references in the function calls as well.

Functions with _log suffix do not use vertical bars, that syntax is a quirk of _lpdf functions. Also the bars are kind of stupid because they always assume your distribution is univariate. You’d have to put the bar after the first argument even if you’re trying to define the joint density of two variables.

You don’t need to wrap your densities in functions. You can just write the expression inside log_mix. Also I think it’s clearer if you use explicit control flow instead of that multiply-by-zero trick.

for (n in 1:N) {
if (index[n] == 0) {
target += log_mix(p, normal_lpdf(y[n]|mu1, sigma[1]),
normal_lpdf(y[n]|mu2, sigma[2]));
} else {
target += log_mix(p, normal_lccdf(y[n]|mu1, sigma[1]),
normal_lccdf(y[n]|mu2, sigma[2]));
}
}


By the way, those inverse gamma priors for sigma are pretty bad, you should consider something weakly informative instead.

2 Likes

Thanks for your answer. And It seems that you DID give a better improvement.
However, a new problem about initial value is put forward, which i also saw before:

Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler\$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done


Since I am a fresher about stan, i know little about debugging. How can I solve it? Thanks again!

Right, Stan always starts by computing the log likelihood of a guess where all parameters have values near zero. Your priors for mu are about 25 standard deviations away from that. It should work anyway but those cumulative probability functions have a bug that makes them fail when the guess is too far off.
You can use the initargument when calling Stan to start at a different guess but if you’re have the newest Stan you should just set the parameter offsets to their expected value.

parameters {
real<offset=50>  mu1;
real<offset=120>  mu2;
...
}

1 Like

Yes, just after replying you, i noticed that init argument does have a influence for the computation the log likelihood. So i change init argument with different values. So of them are available, while most of them are useless which result in log(0) . As a result，I think a proper init value in important!

Besides, I write code with R environment and the corresponding stan is rstan. I will try to add offset=x to parameters chunks.