How to calculate log_lik in generated quantities of two different process(models) running simultaneously in a longitudinal data


Dear Stan community,

I need to use the loo package for model comparison. To do so, one needs to calculate log-lik in the generated quantity block. I’m simultaneously running two models here: Bernoulli and Normal. When I try to generate normal_lpmf, it gives me the error “Identifier ‘mu’ not in scope”. Not everything involved in the model comes directly from the data; aside from ‘x’ as a covariate, some functions are involved in generating other covariates like ‘E’. I wonder how to calculate log_lik for my model when random variables are involved, and can I generate log-likelihood even from the binary part, which involves longitudinal data? Could you please guide me a bit? Thanks a lot.
``

data{
int<lower=1> n; // number of observations
int<lower=1> n1; // number of observations x 2
vector[n1] y; // dep_data
int x_p;
int p; //nr of parameters
matrix[n1,x_p] x; //covariates
int PP1[n,n]; //network
int PP2[n,n]; //network
int PP3[n,n]; //network

int<lower=1> J; // dimension of observations
vector[J] Zero; // a vector of Zeros (fixed means of observations)
corr_matrix[J] I;
}

parameters {
vector<lower=0>[J] tau;
real<lower=0> sigmasq;
vector[p] beta;// add beta 2
matrix[n,J] ZZ; // latent-observations
vector[3] alpha;
}

transformed parameters{
cov_matrix[J] Tau= quad_form_diag(I, tau);
matrix[n, n] ZZ1 = dist(ZZ, n); //dist functions appply
}

model{
matrix[n,n] W = weight(ZZ1, n); // weight function applied
matrix [n1,n1] DW=double_rows_extend_columns(W); // double row function apply
vector [n1] E =DW x[,2];
matrix [n1,3] DD =append_col(x,E);
vector[n1] mu = DD
beta;

  for (i in 1:n){
    ZZ[i,] ~ multi_normal(Zero, Tau);
 }

sigmasq ~ student_t(4,0,5);//gamma(0.1, 0.1);
tau ~  student_t(4,0,5);   

beta ~ normal(0, 10); //regression parameters
alpha ~ normal(0, 10); //regression parameters

for (i in 1:(n-1)){
for (j in (i+1):(n-1)){
real mu2 = alpha[1] + alpha[2]*ZZ1[i,j]+ alpha[3]*PP1[i,j];
real mu3 = alpha[1] + alpha[2]*ZZ1[i,j]+ alpha[3]*PP2[i,j];

PP2[i,j] ~ bernoulli(Phi(mu2));
PP3[i,j] ~ bernoulli(Phi(mu3));
}
}

y ~ normal(mu, sigmasq);

}

generated quantities {
vector[n1] log_lik;
for (k in 1:n1)
log_lik[k] = normal_lpmf(y[k] | mu[k]*beta); // ???
}

****
Error  got: 
Error in stanc(filename, allow_undefined = TRUE) : 0
Semantic error in 'string', line 121, column 32 to column 34:
   -------------------------------------------------
   119:  vector[n1] log_lik;
   120:  for (k in 1:n1)
   121:  log_lik[k] = normal_lpmf(y[k] | mu[k]*beta);
                                         ^
   122:  }
   -------------------------------------------------

Identifier 'mu' not in scope. Did you mean 'I'?

Hi.

Parameters defined in the data, transformed data, parameters and transformed parameters block can be used in the generated quantities block, not local parameters defined in the model block. You would thus need to either define mu in the transformed parameters block or redefine it in the generated quantities block an do the calculations again.

Thank you, @Garren_Hermanus for your help.
But, could you please provide any insight into whether, while remaining within the same code, I can generate log-lik for the Bernoulli model as well, apart from the normal one? And if so, should it be included within the same ‘generated quantities’ block? My concern stems from the longitudinal nature of this process, which makes it somewhat complex to manage.

I am not familiar with the loo package at all, just provide generic help with stan. But if you would have to calculate log_lik their is several things you may consider of which the following is probably simplest:

If mu is a transformation of your existing parameters you can define it in the transformed parameters block and then you may use the log_lik as is in the generated quantities block. That is you can do all the calculations for mu in the generated quantities block. Sample code would be using the code you provided:

transformed parameters {
      cov_matrix[J] Tau= quad_form_diag(I, tau);
      matrix[n, n] ZZ1 = dist(ZZ, n); //dist functions appply
      vector[n1] mu;
      // define local variables for W, DW, W, DD as it is only used for mu calculations
       {
             matrix[n,n] W = weight(ZZ1, n); // weight function applied
             matrix [n1,n1] DW=double_rows_extend_columns(W); // double row function apply
             vector [n1] E =DW x[,2]; // will result in error, correct code
             matrix [n1,3] DD =append_col(x,E);
             mu = DDbeta; // will result in error, correct code
        }
}

Notice the above is with the incorrect code you provided, so you should fix your code for this.

And correct the log_lik statement in your code:

log_lik[k] = normal_lpmf(y[k] | mu[k]*beta, sigmasq); // added standard deviation. 

Also note that the normal distribution in stan takes in arguments mean and standard deviation, not variance. So sigmasq should probably be corrected for this.

  • it should be normal_lpdf instead of normal_lpmf (I did not even know it could run with that)
  • it seems you are using an old Stan version as syntax int PP1[n,n] is not supported anymore. The new syntax is array[n,n] int PP1
  • you can compute log_lik for the Bernoulli parts, too.
  • if you compute pointwise leave-one-out log scores (pointwise elpd_loo) for y, PP1, and PP2, you need to be careful in thinking how the different parts contribute to the total sum (elpd_loo). You could also examine the log scores separately for different targets. Or for the longitudinal data you could look at the joint log score for leave-one-time-point-out of leave-future-out.

Thank you @Garren_Hermanus , a lot for all advices.

Thank you @avehtari for the advice. Part of my concern, which still leaves me confused, is the method for calculating elpd_loo. My model is a joint model involving normal and Bernoulli distributions for two different processes connected by a covariate ZZ1. Now, finding the pointwise elpd_loo for y, PP1, and PP2, and then adding them together will give me the total elpd of the model? Any advice on papers or materials that would help in handling this situation in joint models would be highly appreciated… Thank you very much!

Yes.

CV-FAQ has a short answer. I’m not aware of any paper discussing this in detail. From theoretical and methodological point of view, it is quite trivial, and from applied point of view, it really depends on the modeling goals. If you tell more about your modeling goals, maybe I can help a bit more.

Thanks a lot @avehtari for your help. So the model I am applying is jointly running two processes:
1.One is a linear model: “y ~ normal(mu, sigmasq),” where "mu is a function of ZZ1 and other covariates.
2. The other one is a probit model: PP2[i,j] ~ bernoulli(Phi(mu2))/ PP3[i,j] ~ bernoulli(Phi(mu3)) which also involve ZZ1 as a covariate.
Both involve time.
Now, my question is if I calculate elpd of the first process using log_lik[k] = normal_lpdf(y[k] | mu[k]*beta); and to the second one in a similar way using a bernouly_lpmf invloving for loop, (because I do not see another way how to calculate) and add them, it would give me a correct way of calculating the elpd of an entire model since I have a variable ZZ1 that correlated those? And even if I am interested to find only epld of one process of " y ~ normal(mu, sigmasq) would it be acurate enough since this model is using information of ZZ1 from the other process? The reason I am doing this is because I need to find a way to compare or define predictive accuracy in this type of work.
Thank you!

If you tell about the model in abstract variable names (y, ZZ1, PP2, PP3) and ask a technical question how to technically compute some value, I can tell whether the computation is technically correct (yes), but I can’t tell whether that value is useful for you and whether some other technically correct computation would give you something more useful. What I meant when I proposed that you tell more about your modeling goals is that you would tell about the phenomenon you are modeling and what is the question you actually try to answer. I understand if your project is secret, and you don’t want to share more, but then I’m not able to add much to what I already said.

@avehtari, I appreciate your help here. I went back to read the paper related to the loo package [https://arxiv.org/pdf/1507.04544.pdf]. But when I try to calculate log_lik for every model involved, there’s a question about how to add them because there’s no way that I can generate log_lik and log_lik1 + log_lik2 separately, right? In the generated quantities block that I am adding here, is an error I get. Any advice on why this happened, please?
Thank you!

Error in stanc(filename, allow_undefined = TRUE) : 0
Semantic error in ‘string’, line 135, column 2 to column 56:

  • -------------------------------------------------*
  • 133: real mu3 = alpha[1]*ZZ1[i,j]+alpha[2]PP2[i,j];
  • 134: *
  • 135: PP2[i,j] ~ bernoulli(pow(cauchy_cdf(mu2,0,1),lambda));*
  •       ^*
    
  • 136: log_lik1[i,j]=bernuli_lpmf(PP2[i,j]|mu2[i,j] );*
  • 137: *
  • -------------------------------------------------*

Target can only be accessed in the model block or in definitions of functions with the suffix _lp.

///////////////////////////////////////////////////////////////////////////
generated quantities {
vector[n1] log_lik;
for (k in 1:n1)
log_lik[k] = normal_lpdf(y[k] | mu[k],sigma);// ???

array[n,n] int log_lik1;
array[n,n] int log_lik2;
for (i in 1:(n-1)){
for (j in (i+1):(n-1)){

real mu2 = alpha[1]*ZZ1[i,j]+alpha[2]*PP1[i,j];
real mu3 = alpha[1]*ZZ1[i,j]+alpha[2]*PP2[i,j];

PP2[i,j] ~ bernoulli(Phi(mu2));
log_lik1[i,j]=bernuli_lpmf(PP2[i,j]|mu2[i,j] );

PP3[i,j] ~ bernoulli(Phi(mu3));
log_lik2[i,j]=bernuli_lpmf(PP3[i,j]|mu3[i,j] );
}
}
// how to add??? log_lik1+log_lik2
/// how to generate log_lik and log_lik1+log_lik2
}