Hi,
I have a model for semi continuous positive data where the output values could be exactly zero or a positive real value. I got great advices from this post.
For my prior predictive check, I have the following code which I use lognormal_rng to get samples from. The problem is that lognormal_rng returns negative values for y_pred (predictions over test data) which is strange. I tried forcing the mean of the lognormal to be strictly non-negative because I saw that sometimes its mean is negative (but STAN is silent about that and does not throw an error). So, why negative values? what are the best practices to avoid overflow problem that leads to negative samples from?
I really appreciate if someone please help me debug the issue. Thanks and here is the relevant code:
data {
int<lower=1> N; // number of observations
int<lower=1> N_pred;
int<lower=1> I; // number of predictors
matrix[N_pred,I] X_test; // matrix of predictors
}
transformed data {
vector[N_pred] ones_N_pred = rep_vector(1, N_pred);
matrix[N_pred,I+1] X_test_intercept = append_col(ones_N_pred, X_test);
}
transformed parameters {
vector[I+1] beta_coeff = mu + beta_coeff_raw;
}
parameters {
real mu;
vector[I+1] beta_coeff_zi; // intercept and predictor coefficients
real<lower=0> sigma; // scale parameter
vector<lower=-mu> beta_coeff_raw[I+1]; //Implies lower=0 on beta_coeff to make the mean of lognormal positive
}
model {
beta_coeff_raw ~ normal(0, 1);
beta_coeff_zi ~ normal(0, 1);
sigma ~ exponential(1);
}
generated quantities {
vector[N_pred] y_pred = to_vector((lognormal_rng(X_test_intercept * beta_coeff, sigma)));
vector[N_pred] labels = to_vector(bernoulli_rng(inv_cloglog(X_test_intercept * beta_coeff_zi)));
y_pred = y_pred .* labels;
}
Here is the histogram of y_pred[y_pred !=0] which clearly shows many negative values from lognormal_rng
I also tried using normal_rng instead of lognormal_rng and then use y_pred[y_pred !=0] = exp(y_pred[y_pred !=0]) but I face the overflow issue.
PS: Here is the input to Stan code as a form of DataFrame.