 Problematic Sampling with Model

Hi all,

I am trying to fit a dynamic latent variable model with 2 latent variables that evolve over time. But I’m having some trouble getting the sampler to work (the trace plots look almost flat for several chains). Specifically, the sampler warns me that the E-BFMI is low (e.g. 0.00535) and that I may need to reparameterize my model. But I’ve already tried the reparametization based on the stochastic volatility model (i.e. https://mc-stan.org/docs/2_19/stan-users-guide/stochastic-volatility-models.html) and I’m simply not sure how to proceed from here.

I’ve also imposed sign constraints because I observed that there was sign indeterminacy for some of the parameters while trying different variants of the model.

The idea is that the second latent variable (i.e. F) should decay as a function of time while the first latent variable should describe the temporal evolution of the variables T_t, R_t, I_t . R_t is actually reaction time and its coded as -1 when there is no reaction. I_t is the indicator for reaction/no reaction.

Is the poor sampling behavior due to unidentifiability? Or is it because the two latent share some descendants in the graphical model (like how cross-loadings often lead to estimation issues in factor analysis)?

Would greatly appreciate any insights into this issue!

data {
int<lower=0>                                  N;
int<lower=1>                                  numBlocks;
int<lower=1>                                  num_timepts;
int<lower=0,upper=1>                          I[numBlocks, N];
matrix<lower=-1>[numBlocks, N]                R;
matrix<lower=0,upper=1>[numBlocks, N]         V;
matrix<lower=0,upper=100>[numBlocks, N]       T;
vector[N]                                     G;
int<lower=0,upper=1>                          E[numBlocks, num_timepts, N];
int<lower=0,upper=1>                          S[numBlocks, num_timepts, N];
matrix[N,2]                                   Grp;
matrix<lower=0>[numBlocks, N]                 Z;
}

transformed data {
int     num_S0[numBlocks, N];
int     num_S1[numBlocks, N];
int     num_E_S0[numBlocks, N];
int     num_E_S1[numBlocks, N];

matrix<lower=0>[numBlocks, N]  I_saw_event;

for (blk_i in 1:numBlocks) {
for (per_i in 1:N) {

num_S0[blk_i, per_i] = 0;
num_S1[blk_i, per_i] = 0;
num_E_S0[blk_i, per_i] = 0;
num_E_S1[blk_i, per_i] = 0;

for (time_i in 1:num_timepts) {

num_S0[blk_i, per_i] += (1 - S[blk_i, time_i, per_i]);
if ( S[blk_i, time_i, per_i] == 0 ) {
num_E_S0[blk_i, per_i] += E[blk_i, time_i, per_i];
} else {
num_E_S1[blk_i, per_i] += E[blk_i, time_i, per_i];
}

}

num_S1[blk_i, per_i] = num_timepts - num_S0[blk_i, per_i];

}
}

I_saw_event = 1.0 - to_matrix(I);

}

parameters {
row_vector[N]                       B1_std;
row_vector[N]                       F1_std;

real             b_B_Bt_slope;
real             b_B_Tt_slope_raw;      // Constrained to be positive and set upper bound
real             b_B_Rt_slope_raw;      // Constrained to be positive
real             b_V_Bt_slope_raw;      // Constrained to be positive
real             b_V_It_slope;
real             b_V_It_int;
real            b_F_It_slope_raw;   // Constrained to be negative
real            b_F_Et_slope_raw;   // Constrained positive
real            F_int;
real            b_Grp_Ft_slope_1;
real            b_Grp_Ft_slope_2_raw; // Constrain to be negative
real            b_Z_Bt_slope_raw;  // Constrain to be positive

real<lower=0>  sigma_Rt;
real<lower=0, upper=2>  sigma_Tt;       // Impose upper bound
real<lower=0> sigma_G;

real F_G_slope;

}

transformed parameters {
matrix[numBlocks+1, N]      B;
matrix[numBlocks, N]      F;

real<lower=5,upper=100>     b_B_Tt_slope;
real<lower=0>     b_B_Rt_slope;
real<lower=0>     b_V_Bt_slope;
real<upper=0>     b_F_It_slope;
vector<upper=0>[N]     F_decay;
real<upper=0>     b_Grp_Ft_slope_2;
row_vector        state_coefs;
real<lower=0> b_Z_Bt_slope;
real<lower=0> b_F_Et_slope;

b_B_Tt_slope = 95 * inv_logit(b_B_Tt_slope_raw) + 5;
b_B_Rt_slope = exp(b_B_Rt_slope_raw);
b_V_Bt_slope = exp(b_V_Bt_slope_raw);
b_Z_Bt_slope = exp(b_Z_Bt_slope_raw);
b_Grp_Ft_slope_2 = -exp(b_Grp_Ft_slope_2_raw);

b_F_It_slope = -exp(b_F_It_slope_raw);
b_F_Et_slope = exp(b_F_Et_slope_raw);

for (per_i in 1:N) {
F_decay[per_i] = -exp(b_Grp_Ft_slope_1 * (Grp')'[per_i] + b_Grp_Ft_slope_2 * (Grp')'[per_i] + F_int);
}

// Initialize
B = B1_std;
F = exp(F1_std);

// State equation
state_coefs = [b_V_Bt_slope, b_B_Bt_slope, b_Z_Bt_slope];
for (t in 2:(numBlocks+1)) {
B[t] = state_coefs * [V[t-1] .* I_saw_event[t-1], B[t-1], Z[t-1]];
if (t <= numBlocks) {
for (per_i in 1:N) {
F[t, per_i] = pow(F[1, per_i], (t-1) * F_decay[per_i]);
}
}
}

}

model {

// Priors for coefficients
F_int ~ normal(0,5);
b_Grp_Ft_slope_1 ~ normal(0,5);
b_Grp_Ft_slope_2 ~ normal(0,5);
F_G_slope ~ normal(0,5);
b_B_Bt_slope ~ normal(0,5);
b_V_It_slope ~ normal(0,5);
b_V_It_int ~ normal(0,5);
b_F_Et_slope_raw ~ normal(0,1);
b_F_It_slope_raw ~ normal(0,1);
sigma_Rt ~ normal(0,1);
sigma_Tt ~ normal(0,1);
b_B_Tt_slope_raw ~ normal(0,1);
b_B_Rt_slope_raw ~ normal(0,1);
b_V_Bt_slope_raw ~ normal(0,1);
b_Z_Bt_slope ~ normal(0,5);
sigma_G ~ normal(0,1);
F_G_slope ~ normal(0,5);

// Priors for latent variables
B1_std ~ std_normal();
F1_std ~ std_normal();

// Measurement model
for ( k in 1:numBlocks ) {

// Measurement for I
I[k] ~ bernoulli_logit( b_V_It_int + b_V_It_slope * V[k] + b_F_It_slope * F[k] );

// Measurement for R
for ( rt_n in 1:N ) {
if ( I[k, rt_n] == 0 ) {
R[k, rt_n] ~ lognormal( b_B_Rt_slope * B[k, rt_n] , sigma_Rt );
} else {
R[k, rt_n] ~ normal( -1, 0.001 );
}
}

// Measurement for T
T[k] ~ normal( b_B_Tt_slope * B[k+1], sigma_Tt );

// Measurement for E
for (e_n in 1:N) {
if ( num_S0[k, e_n] != 0 ) {
num_E_S0[k, e_n] ~ binomial_logit( num_S0[k, e_n], b_F_Et_slope * F[k, e_n] );
}

if ( num_S1[k, e_n] != 0 ) {
num_E_S1[k, e_n] ~ binomial_logit( num_S1[k, e_n], b_F_Et_slope * F[k, e_n] );
}
}

}

// Measurement for G
G ~ normal( F_G_slope * F' * [1./8., 1./8., 1./8., 1./8., 1./8., 1./8., 1./8., 1./8.]', sigma_G );

}  Maybe @bbbales2 could help?

This model looks pretty complicated. Is there a simpler model that also reproduces the problem? You want to build these things in pieces. Lots of stuff can go wrong in a complex model. And we want to straighten out if it’s just a bug in the specification or a numerical problem and these are easier to catch with simpler models.

By the way, use the Stan builtin stuff to handle the constraints: https://mc-stan.org/docs/2_20/reference-manual/univariate-data-types-and-variable-declarations.html

For instance just declare b_B_Rt_slope as:

real<lower = 0.0> b_B_Rt_slope;

You probably got some warnings about jacobians for statements like b_B_Bt_slope ~ normal(0,5);. Those will go away if you use the constrained syntax. More on why here but feel free to ask (it’s a little weird): https://mc-stan.org/docs/2_20/reference-manual/variable-transforms-chapter.html

Thanks for responding! The phenomena that we’re trying to model is a little complicated so we went ahead with the complex model first. But it seems like we have to build up the model from simpler ones to diagnose the problem… I’ll have to try running a few simpler versions and see how it goes. I’ll ask on the forums again when I run into problems with the simpler models.

Actually I’m just curious: is it really a bad idea to use expressions like

b_B_Rt_slope = exp(b_B_Rt_slope_raw);

to impose positive constraints? When I used Stan’s builtin constraint declarations the sampler will often tell me that some iterations are rejected because they are invalid. Intuitively, it also seems like the sampler will work better exploring the unconstrained space? Or am I mistaken? (Edit: Oh my bad. I just saw in the link that Stan is already sampling in the unconstrained space by taking the exact same transformation).

Thanks for catching that error! I fixed it already.

If you’re doing the transformation manually, there’s also a Jacobian adjustment you would need to make it equivalent to what Stan is doing with the constraints.

Since you’re putting priors on the transformed parameters, you’d want that Jacobian adjustment though (otherwise it’s unclear what the priors mean). Check out sections 10.1 and 10.2 here and see if they make sense: https://mc-stan.org/docs/2_20/reference-manual/change-of-variables-section.html

2 Likes

Actually read the intro to that as well: https://mc-stan.org/docs/2_20/reference-manual/variable-transforms-chapter.html

2 Likes

So I’ve tried running two simpler models (its slightly different from the model in my previous post because new data for the second block just came in):

Essentially, I’m trying to introduce a latent variable T_t that evolves across time. T_t is measured by the observed variable S_t and it is updated across time as a function of the observed variables (Y_t, C_t) in the first block and as a function of (Y_t, C_t, D_t) in the second block (see the transformed parameters section called “State Equation”).

For this model, I ran 6 chains with the HMC sampler with adapt_delta = 0.99, max_treedepth = 15. The sampler seems to work fine (mixing is good) for 5 of the chains. However, the sampler is telling me that all the samples from one of the chains is hitting the max_treedepth. I’m now running the sampler again with max_treedepth = 20. Is that the correct approach or is it an indication that the model is pathological? Since most of the chains converged to sampling the same space, can I actually drop the problematic chain when doing further analyses?

Also, I’ve tried running an even simpler model where I removed the variable D_t. Specifically, I’ve changed the following two lines:

// Transformed parameters block
T2[t,n] = to_array_1d(Tt2_vec' + l_rate_nobrake * ( Z[C2[t-1,n]] ) * D2[t-1,n] / D_total_2[t-1, n] * Y2_error[t-1,n]);
// Model block
O2_dv[k,n] ~ bernoulli( (1.0 * D2[k,n] / D_total_2[k,n]) * Phi_approx( to_row_vector(T2[k,n]) * Z[C2[k,n]]' ) );

to the following:

// Transformed parameters block
T2[t,n] = to_array_1d(Tt2_vec' + l_rate_nobrake * ( Z[C2[t-1,n]] ) * Y2_error[t-1,n]);
// Model block
O2_dv[k,n] ~ bernoulli( Phi_approx( to_row_vector(T2[k,n]) * Z[C2[k,n]]' ) );

where D_t was removed. Surprisingly, the sampler started behaving very strangely (e.g. low E-BFMI for 3 of the chains, 30% divergent transitions, hitting max_treedepth) despite the small change. I can’t seem to wrap my head around this phenomenon, but perhaps you would have more insights into this issue?

I’ve attached a sample of the traceplots below. The first one is from the model with D_t and the second is from the model without.

Would greatly appreciate any inputs!

Stan code can be found below:

data {
// Data for first block
int<lower=0>                          N;                        // Number of subjects
int<lower=1>                          numblocks;                // Number of time blocks for FIRST PART
int<lower=0>                          E[numblocks, N];          // Eye-tracking for non-critical event
int<lower=0>                          E_total[numblocks, N];    // Total number of time points for non-critical event
int<lower=0>                          D[numblocks, N];          // Eye-tracking for critical event
int<lower=0>                          D_total[numblocks, N];    // Total nnumber of time points for critical event
real<lower=0,upper=100>               S[numblocks, N];          // Observed trust rating
int<lower=-1,upper=1>                 B[numblocks, N];          // Indicator for noticing the event
real<lower=0>                         R[numblocks, N];          // Reaction time
int<lower=-1,upper=1>                 Y[numblocks, N];          // Outcome of event
int<lower=1,upper=4>                  C[numblocks, N];          // Type of event
matrix[3,N]                           W;                        // Experimental Group

// Data for second block
int<lower=0>                          numblocks2;                 // Number of time blocks for SECOND PART
int<lower=0>                          E2[numblocks2, N];          // Eye-tracking for non-critical event
int<lower=0>                          E_total_2[numblocks2, N];    // Total number of time points for non-critical event
int<lower=0>                          D2[numblocks2, N];          // Eye-tracking for critical event
int<lower=0>                          D_total_2[numblocks2, N];    // Total nnumber of time points for critical event
real<lower=0,upper=100>               S2[numblocks2, N];          // Observed trust rating
int<lower=-1,upper=1>                 O2[numblocks2, N];          // Indicator for overtake
real<lower=0>                         R2[numblocks2, N];          // Reaction time
int<lower=-1,upper=1>                 Y2[numblocks2, N];          // Outcome of event
int<lower=1,upper=4>                  C2[numblocks2, N];          // Type of event
}

transformed data {
// Indices to transform event types
int<lower=1, upper=4> idxs[3, 2]
= { {2, 1},
{3, 1},
{4, 1} };

// Transform +1, -1 binaries to allow for bernoulli
int<lower=0,upper=1>        B_dv[numblocks, N];
int<lower=0,upper=1>        Y_dv[numblocks, N];
int<lower=0,upper=1>        O2_dv[numblocks2, N];
int<lower=0,upper=1>        Y2_dv[numblocks2, N];

// Transform to 0, 1
// This is stupid but I can't seem to find out how to add constant matrix
// Brute force (only once)
for (n in 1:N) {
for (t in 1:numblocks) {
B_dv[t, n] = (B[t, n] + 1) / 2;
Y_dv[t, n] = (Y[t, n] + 1) / 2;
}

for (k in 1:numblocks2) {
O2_dv[k, n] = (O2[k, n] + 1) / 2;
Y2_dv[k, n] = (Y2[k, n] + 1) / 2;
}
}
}

parameters {
// Hyperparameters
vector                 mu_T1;

// Latent encoding for C
vector                 Z_raw;

// Individual Initial Parameters
matrix[1,N]               T1_std;

// Other Parameters
vector<lower=0>        T_S_slope;
real                      O_Y_slope;
real                      O_Y_int;
real<lower=0>             l_rate_saw;
real<lower=0>             l_rate_brake;
real<lower=0>             l_rate_nobrake;

// Error variances
real<lower=0,upper=5>     sigma_S;
}

transformed parameters {
// Latent trust
real                      T[numblocks+1, N, 2];
real                      T2[numblocks2+1, N, 2];

// Prediction error
matrix[numblocks, N]      Y_error;
matrix[numblocks2, N]     Y2_error;

// Declare temporary matrices / vectors
matrix[1,N]                  T1_tmp;
vector                    Tt_vec;
vector                    Tt2_vec;

// Latent encoding
matrix[4,1]               Z;
for (i in 1:3) {
Z[idxs[i, 1], idxs[i, 2]] = Z_raw[i];
}
Z[1,1] = 0;

// Define Initial Trust / Attention
for (k in 1:1) {
T1_tmp[k] = T1_std[k] + mu_T1[k];
}
T = to_array_2d(T1_tmp');

// State Equation (First Phase)
for (t in 2:(numblocks+1)) {
for (n in 1:N) {
Tt_vec = to_vector(T[t-1,n]);
Y_error[t-1,n] = Y[t-1,n] - Z[C[t-1,n]] * Tt_vec;
if (B[t-1,n] == 1) {  // Update if they see the event
T[t,n] = to_array_1d(Tt_vec' + l_rate_saw * ( Z[C[t-1,n]] ) * Y_error[t-1,n]);
} else {
T[t,n] = T[t-1,n];
}
}
}

// Connect first and second phase
T2 = T[numblocks+1];

// Second Phase
for (t in 2:(numblocks2+1)) {
for (n in 1:N) {
Tt2_vec = to_vector(T2[t-1,n]);
Y2_error[t-1,n] = Y2[t-1,n] - Z[C2[t-1,n]] * Tt2_vec;
if (O2[t-1,n] == -1) {  // Did not takeover, weigh by probability of observing event
T2[t,n] = to_array_1d(Tt2_vec' + l_rate_nobrake * ( Z[C2[t-1,n]] ) * D2[t-1,n] / D_total_2[t-1, n] * Y2_error[t-1,n]);
} else { // takeover
T2[t,n] = to_array_1d(Tt2_vec' + l_rate_brake * ( Z[C2[t-1,n]] ) * Y2_error[t-1,n]);
}
}
}

}

model {

// Prior for Hyperparameters
mu_T1 ~ normal(0,1);

// Prior for latent encoding
Z_raw ~ normal(0,1);

// Prior for individual trust
to_vector(T1_std) ~ normal(0,1);

// Prior for other Parameters
T_S_slope       ~ normal(0,10);
O_Y_slope       ~ normal(0,10);
O_Y_int         ~ normal(0,10);
l_rate_saw      ~ normal(0,10);         // Automatically half-normal
l_rate_nobrake  ~ normal(0,10);         // Automatically half-normal

// Prior for variances
sigma_S         ~ normal(0,1);

// Likelihood for first phase
for (k in 1:numblocks) {
S[k] ~ normal( to_matrix(T[k+1]) * T_S_slope, sigma_S );
}

// Likelihood for the second phase
for (k in 1:numblocks2) {
S2[k] ~ normal( to_matrix(T2[k+1]) * T_S_slope, sigma_S );
Y2_dv[k] ~ bernoulli_logit( O_Y_int + O_Y_slope * to_vector(O2[k]) );
for (n in 1:N) {
O2_dv[k,n] ~ bernoulli( (1.0 * D2[k,n] / D_total_2[k,n]) * Phi_approx( to_row_vector(T2[k,n]) * Z[C2[k,n]]' ) );
}
}

}

max_treedepth is on a log2 scale, so saying max_treedepth = 15 means that internally Stan is putting up to 2^5 (32x) extra work into drawing each sample (it’s solving an ODE internally and integrating it up to 32x as long). The chain that is hitting max_treedepth means that you’re always putting in this extra work.

adapt_delta = 0.99 will make Stan lower the timestep of the ODE a lot.

The behavior you’re describing – not all chains agreeing, huge treedepths, and high adapt_delta, means this model is giving Stan a really hard time. It means there’s good reason to believe the samples you’re getting back are not representative of the posterior. I think that’s the right way to say it. So the numerics (Stan) are giving you results probably not totally representative of the probabilistic model (the likelihood and prior you wrote down).

It sounds like the simplifications you tried didn’t work. Any chance you can simplify further? Like maybe 1/2 the lines of code :D? What you have still seems like quite a bit.

Once you’ve simplified it as much as possible, switch to simulated data. Make up values for all the parameters, generate some data in R or whatever, and go about fitting it.

Start by fixing most of the parameters to the true value used to generate the data. Relax this incrementally and see where things break. I don’t think time series problems with latent values are especially easy, for what it’s worth.

2 Likes