NLME model in STAN for missing + unbalanced longitudinal data

Divergent issues; R-hat issues; ESS issues. Any tips/advice is much appreciated.

My thoughts on potential issues:

  1. Maybe it’s how I’m dealing with unbalanced longitudinal data portion i.e. if (i != previous\_id)? I thought the logic makes sense here so I sample random effects whenever ‘ID’ in the dataset changes to a new value.
  2. Maybe it’s that I’m trying to deal with the model’s function of ‘age’ by having two indexing: i for the unique subject Id’s + row\_index for the rows of my dataset. Is there a better way to do this (anything in the STAN manual for modeling subject-specific observed data that are functions of time)?

I attached snippet of data, full code, and full math:
Screenshot 2024-09-23 at 2.59.42 PM


data {
    int<lower=0> NJ; // dataset has NJ rows; $\sum_{i=1}^{N} j_{i}$ <=> each subject has different number of observations)
    int<lower=0> N; // number of subjects

    array[NJ] int<lower=0> all_subject_ids;
    array[NJ] real<lower=0> ages;
    array[NJ] real<lower=0> y_obs;
    array[NJ] real<lower=0> y_obs_bool;
}

parameters {
    real<lower=0,upper=1> tau; // RV

    real<lower=log(10),upper=log(550)> log_bar_thetaA; // fixed effect
    real<lower=log(5), upper=log(20)> log_bar_thetaB; // fixed effect
    real<lower=log(40), upper=log(70)> log_bar_thetaC; // fixed effect

    real<lower=0, upper=1> omega_thetaA; // s.d. for between-subject variability (BSV) random effect
    real<lower=0, upper=1> omega_thetaB; // s.d. for BSV random effect
    real<lower=0, upper=1> omega_thetaC; // s.d. for BSV random effect
    
    vector[N] log_thetaA; // BSV random effect
    vector[N] log_thetaB; // BSV random effect
    vector[N] log_thetaC; //BSV random effect
}

model {
    int previous_id;
    previous_id = 0;
    int i;

    vector[NJ] y_pred;

    for (row_indx in 1:NJ) {
        if (y_obs_bool[row_indx]==1) { // baseline observations (at first age) are always observed
            i = all_subject_ids[row_indx]; // can repeat
            if (i!=previous_id) {
                log_thetaA[i] ~ normal(log_bar_thetaA, omega_thetaA);
                log_thetaB[i] ~ normal(log_bar_thetaB, omega_thetaB);
                log_thetaC[i] ~ normal(log_bar_thetaC, omega_thetaC);

                previous_id = i;
            }
            y_pred[row_indx] = exp(log_thetaA[i]) * (1 - ( ( 1 * ( ages[row_indx] ^ exp(log_thetaC[i]) ) ) / ( exp(log_thetaB[i]) ^ exp(log_thetaC[i]) + ages[row_indx] ^ exp(log_thetaC[i]) ) ) );

            target += normal_lpdf(y_obs[row_indx] | y_pred[row_indx], y_pred[row_indx] * tau);
        }
    }
}


Sigmoid \text{I}_\text{max} model:
y_{i}(\text{age})= \theta_{iA}\cdot \Bigg(1-\frac{1*\cdot\text{age}_{i}^{\theta_{iC}}}{\theta_{iB}^{\theta_{iC}}+\text{age}_{i}^{\theta_{iC}}}\Bigg)\cdot(1+\epsilon_{i}), \ \ \ \epsilon_{i}\sim\text{N}(0,\tau^{2})
for j=A, B, C:
\ln(\theta_{ij})\sim\text{N}(\ln(\bar{\theta}_{j}),\omega_{j}^{2})
where \bar{\theta}_{j} is fixed effect, \omega_{j} is s.d. for between-subject variability (BSV) of random effect; \tau is s.d. for residual variability (RV)

I didn’t look at this really carefully, but is there are reason that you can’t just do

log_thetaA[id[i]] ~ normal(log_bar_thetaA, omega_thetaA);
log_thetaB[id[i]] ~ normal(log_bar_thetaB, omega_thetaB);
log_thetaC[id[i]] ~ normal(log_bar_thetaC, omega_thetaC);

and avoid the if statement involving previous_id. That may not fix anything, but it might be worth a try.

Kent

1 Like

Thank you very much for the response.

Could you expand on what you mean by id[i]. Sorry in advance if this is a simple concept. I might have a vague idea of what you’re talking about but please note that my data is unbalanced (8 ID 1’s, 6 ID 2’s).

My thought process behind log_thetaA[i], log_thetaB[i], log_thetaC[i] was because I need subject-specific random effects.

Best,
Michael

I should have written ID[i]. That’s a variable in your data. ID[i] is the ID associated with the ith row of data. For example, ID[4] is 1. Unless I’m missing something, using this approach avoids the previous_id comparison in your code.

Kent

1 Like

Oh I see! Thank you kindly for the response and explanation. I should’ve figured out what you meant.

That definitely improved my code but still problematic (same divergence, R-hat, ESS issues but smaller numbers). Still, thank you again.

Best,
Michael

Updated based on kholsinger’s (Kent’s) suggestion:

data {
    int<lower=0> NJ; // dataset has NJ rows; $\sum_{i=1}^{N} j_{i}$ <=> each subject has different number of observations)
    int<lower=0> N; // number of subjects

    array[NJ] int<lower=0> ID;
    array[NJ] real<lower=0> ages;
    array[NJ] real<lower=0> y_obs;
    array[NJ] int<lower=0,upper=1> y_obs_bool;
}

parameters {
    real<lower=0> tau; // RV

    real log_bar_thetaA; // fixed effect
    real log_bar_thetaB; // fixed effect
    real log_bar_thetaC; // fixed effect

    real<lower=0> omega_thetaA; // s.d. for between-subject variability (BSV) random effect
    real<lower=0> omega_thetaB; // s.d. for BSV random effect
    real<lower=0> omega_thetaC; // s.d. for BSV random effect
    
    vector[N] log_thetaA; // BSV random effect
    vector[N] log_thetaB; // BSV random effect
    vector[N] log_thetaC; //BSV random effect
}

model {
    vector[NJ] y_pred;

    for (i in 1:NJ) { // going through rows of dataset; note 'i' isn't indexing unique subject ID's '1:N' but rather rows of data
        if (y_obs_bool[i]==1) { // note that baseline observations (at first age) are always observed
            log_thetaA[ID[i]] ~ normal(log_bar_thetaA, omega_thetaA);
            log_thetaB[ID[i]] ~ normal(log_bar_thetaB, omega_thetaB);
            log_thetaC[ID[i]] ~ normal(log_bar_thetaC, omega_thetaC);

            y_pred[i] = exp(log_thetaA[ID[i]]) * (1 - ( ( 1 * ( ages[i] ^ exp(log_thetaC[ID[i]]) ) ) / ( exp(log_thetaB[ID[i]]) ^ exp(log_thetaC[ID[i]]) + ages[i] ^ exp(log_thetaC[ID[i]]) ) ) );

            target += normal_lpdf(y_obs[i] | y_pred[i], y_pred[i] * tau);
        }
    }
}

I would recommend against using interval priors unless things have physically constrained bounds. For instance, it looks like you’re constraining standard deviations between 0- and 1 and effects between upper and lower log scale values. One thing you can check to see if this is a problem is whether probability is bunching up against the boundaries—that will cause divergences, etc. as it drives values on the unconstrained scale to +/- infinity.

For priors on your random effects, you should just be looping over N to make sure all of the variables get priors, e.g.,

log_thetaA ~ normal(log_bar_thetaA, omega_thetaA);

This produces a centered parametrization, so you need to go back and declare the variables with the bounds to get a non-centered parameterization, e.g.,

vector<offset=log_bar_thetaA, multiplier=>omega_thetaA> log_thetaA;

Then loop through the row_indx values.

I was curious about why you’re multiplying y_pred[row_indx] * tau to get the scale of the observations.

You may also be running into numerical instabilities here (and you never need to multiply by 1!):

y_pred[row_indx] 
= exp(log_thetaA[i]) 
* (1 - ages[row_indx]^exp(log_thetaC[I])
       / (exp(log_thetaB[i])^exp(log_thetaC[i]) 
          + ages[row_indx]^exp(log_thetaC[I])));

I would check that the right hand side here isn’t overflowing or underflowing. Particularly, taking exp(a)^exp(b) is going to overflow or underflow whenever the absolute value of b is greater than 6 unless exp(a) is very close to 1.

The observations for y_obs are constrained to be positive so you have to actually include the bounds for the normal_lpdf to get things to normalize because the arguments are not constant. That is, you want something like:

y_obs[row_indx] ~ normal(y_pred[row_indx], y_pred[row_indx] * tau) T[0, ];

where the T provides the truncation below at 0. Otherwise, the normal’s not going to be normalized with y_obs constrained to be positive.

2 Likes

Hi Bob,

Thank you so much for the wonderful advice and tips. I am already seeing huge improvements after implementing your suggestions to the best of my abilities (Only 31/10000 divergent transitions and no other warnings – I ran 10000 iterations). Q1: Would this be “okay” as far as warnings go? Below is the updated code:

data {
    int<lower=0> NJ; // dataset has NJ rows; $\sum_{i=1}^{N} j_{i}$ <=> each subject has different number of observations)
    int<lower=0> N; // number of subjects

    array[NJ] int<lower=0> ID;
    array[NJ] real<lower=0> ages;
    array[NJ] real<lower=0> y_obs;
    array[NJ] int<lower=0,upper=1> y_obs_bool;
}

parameters {
    real<lower=0> tau; // s.d. for residual variability (RV)

    real log_bar_thetaA; // fixed effect
    real log_bar_thetaB; // fixed effect
    real log_bar_thetaC; // fixed effect

    real<lower=0> omega_thetaA; // s.d. for between-subject variability (BSV) random effect
    real<lower=0> omega_thetaB; // s.d. for BSV random effect
    real<lower=0> omega_thetaC; // s.d. for BSV random effect
    
    vector<offset=log_bar_thetaA, multiplier=omega_thetaA> [N] log_thetaA; // BSV random effect
    vector<offset=log_bar_thetaB, multiplier=omega_thetaB> [N] log_thetaB; // BSV random effect
    vector<offset=log_bar_thetaC, multiplier=omega_thetaC> [N] log_thetaC; // BSV random effect
}

model {
    vector[NJ] y_pred;

    for (i in 1:N) { // going through the unique subjects
        log_thetaA[i] ~ normal(log_bar_thetaA, omega_thetaA);
        log_thetaB[i] ~ normal(log_bar_thetaB, omega_thetaB);
        log_thetaC[i] ~ normal(log_bar_thetaC, omega_thetaC);

        if (abs(log_thetaC[i]) > 6) {
            print("exp(a)^exp(b) possible overflow/underflow: |b|=", abs(log_thetaC[i]), "; exp(a)=", exp(log_thetaB[i]), " =1ish (hopefully)");
        }
    }

    for (row_indx in 1:NJ) { // going through rows of dataset
        if (y_obs_bool[row_indx]==1) { // note that baseline observations (at first age) are always observed
            y_pred[row_indx] = exp(log_thetaA[ID[row_indx]]) * (1 - ages[row_indx] ^ exp(log_thetaC[ID[row_indx]]) / (exp(log_thetaB[ID[row_indx]]) ^ exp(log_thetaC[ID[row_indx]]) + ages[row_indx] ^ exp(log_thetaC[ID[row_indx]])));
            
            y_obs[row_indx] ~ normal(y_pred[row_indx], y_pred[row_indx]*tau) T[0, ];
        }
    }
}

Q2: Possible overflow issues?
output_and_print.txt (31.6 KB)
Does the bunch of print statements before Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup) matter? I only get a single print statement at Chain 1: Iteration: 3500 / 10000 [ 35%] (Sampling) and at Chain 1: Iteration: 4500 / 10000 [ 45%] (Sampling).

###“I was curious about why you’re multiplying y_pred[row_indx] * tau to get the scale of the observations.”### –
Based on my statistical understanding of the model: y_{i}(\text{age})|\theta_{iA}, \theta_{iB}, \theta_{iC} \sim \text{N}(y_{i,\text{pred}}(\text{age}), y_{i,\text{pred}}(\text{age})*\tau), where y_{i,\text{pred}}(\text{age}) = \theta_{iA}\cdot \Bigg(1-\frac{1*\cdot\text{age}_{i}^{\theta_{iC}}}{\theta_{iB}^{\theta_{iC}}+\text{age}_{i}^{\theta_{iC}}}\Bigg). Q3: I wasn’t sure if it was entirely necessary to write in STAN (and if necessary, how to specifically write the following): conditional on the random effects \theta_{iA}, \theta_{iB}, \theta_{iC}.

Thank you again for the great fixes,
Michael

Things to do (in general):

  1. Sample from the prior. Do you have divergences? Then you have an issue with model specification. Check the samples from the prior. Do they make sense? Once you have this working go the next step.
  2. Make synthetic data and make sure you can recover the generative parameters with good ESS and rhat and no divergences. If you can get good fits without warning from synthetic data, then move on to fitting your data.
  3. If you still have warnings in your real data, this may indicate that some (or all of your subjects) are not using the strategy that you expect.
1 Like

Hi jerlich,

Thank you for the great suggestions! I will try all of this very soon.

Update: Chains aren’t mixing well for 3-4 chains (2 chains works fine somehow). I’m trying some different things at the moment. Will definitely look into my prior samples per jerlich’s advice.

We just talked about this in the Stan meeting today. It’s hard to know how much bias is getting introduced by those divergent transitions. One way to check how much bias is being introduced is with simulation-based calibration checks, but that’s expensive. I’ll reply to @jerlich’s related suggestions below.

I wouldn’t worry about these warnings during warmup. Often you’re in the tail of the distribution where numerics are less stable during warmup and then things settle down during sampling.

When you get this kind of warning during sampling, it’s just a specific kind of divergence.

I’m also not sure why you’re multiplying by 1 in the model—that doesn’t do anything, but then it has a 1 *., so I’m not sure what the intent of the notation is.

OK—that makes sense. I wasn’t parsing the whole model. Is this intended to be something like a normal approximation to a Poisson? In that case, you usually want variance to be equal to mean, so that’d be normal(y, sqrt(y)) for the direct approximation.

This looks like you’re trying to get multiplicatively scaled errors. It looks like your variable is constrained to be positive, in which case, a lognormal distribution might make more sense. There the errors are naturally multiplicative (the additive errors get exponentiated).

I think this got garbled during transmission.

1 Like

You can usually sample a model using purely forward simulation with RNGs rather than with HMC. Then you won’t get any divergences unless values underflow/overflow.

This is tricky. I know we’ve talked about prior predictive checks as part of workflow, and I think it can be useful to avoid extreme blunders. But we don’t really require simulations from the prior to make sense if the prior’s only weakly informative. The only way a prior predictive check is going to “pass” in the sense of a posterior predictive check, is when the prior looks a lot like the posterior. We’re still thinking about how to balance this.

It’s important here to generate the data from the model. I see a lot of failure here with people generating with a separate process.

The sense of recovery you want is posterior interval coverage—50% of the values should be covered by 50% posterior intervals. More precisely, we expect the number out of N elements to be within their 50% posterior intervals to be distributed as binomial(0.5, N) over multiple runs.

Although it’s safest to do so if you can, you don’t necessarily have to eliminate all the divergences to pass SBC checks. In fact, using SBC is the way to make sure that things like divergences are not causing major bias in estimation. We’re always going to get error from sampling, and the bias we get from a couple divergences may be dwarfed by sampling variance. Simulation based calibration checks can find this.

Right. There are two places where divergences will arise. One is with misspecified models that don’t match the data well. That can cause a problem for sampling as it will induce bad geometry trying to balance mismatched prior/likelihood and data. The second is when there are just problems sampling because HMC isn’t great at every target density. HMC struggles with too much posterior correlation (poorly identified models) and with highly varying curvature (like the funnel in a centered parameterization). Sometimes this can be fixed by reparameterizing the model—it’s not always due to misspecification.

2 Likes

Thank you so much for the detailed responses. To answer your questions:

I’m also not sure why you’re multiplying by 1 in the model—that doesn’t do anything, but then it has a 1 *., so I’m not sure what the intent of the notation is.

This is my bad. The “1” multiplier was originally suppose to be a fixed effect, but for empirical reasons from a paper (I’m using a paper’s existing model), the fixed effect was set to 1. I originally left the “1” multiplier as a reminder/placeholder for the fixed effect term, not knowing it might cause numerical issues. I deleted it in my up-to-date code.

I wasn’t parsing the whole model. Is this intended to be something like a normal approximation to a Poisson? … This looks like you’re trying to get multiplicatively scaled errors. It looks like your variable is constrained to be positive, in which case, a lognormal distribution might make more sense.

I will definitely try these out (Poisson and Log-Normal), but unfortunately in the long-run I think I’m constrained within my project to implement the already existing model from a paper (TBD). I’m not sure if their intention was to normal approximate the Poisson. An interesting side-note is that I have less occurrences of issues after removing T[0, ] bound for the normal_lpdf of y_obs (ran multiple runs with different seed numbers)

Q3: I wasn’t sure if it was entirely necessary to write in STAN (and if necessary, how to specifically write the following): `conditional on the random effects…
=> I think this got garbled during transmission.

Sorry for my confusing wording. All I meant to ask is if I have to specify “conditional on priors” in STAN for my normal distribution on y_obs. I’m guessing the answer is probably not.

Just to be clear, multiplying by 1 should not introduce numerical issues. Just some inefficiency in propagating derivatives through one more chain-rule step.

If anything on the right-hand side involves a parameter, you need the T[0, ] to get the right result under the truncated distribution assumption you’re making. Otherwise, your distribution isn’t properly normalized for y_obs.

I’m not sure what that means. If the model for y_obs involves parameters and those parameters have priors, then everything works together jointly. And you definitely don’t need the T[0, ] for a data outcome that

P.S. It’s “Stan” because it’s named after Stanislaw Ulam, rather than being an acronym.

1 Like

Thanks for fleshing out my tl;dr answer :)

One point about sampling from the prior: in our experience it is very hard to reason about hierarchical priors. We fit hierarchical models (estimating effects from a within subjects pharmacological experiment) and inspecting the prior samples made us realize we needed much tighter priors than we naively expected at the subject level. Once we tightened those priors we eliminated divergences and improved recovery of generative parameters from synthetic data.

I should maybe add that in our case we probably should have specified priors not 1 by 1 for each parameter but as a joint distribution.

1 Like

Thank you again Bob. Everything dully noted :) I really do appreciate you pointing out “Stan” Sorry it took me so long to respond; I wanted to try many things with my Stan code and didn’t want to post something erroneous. I actually found out the exact same results as jerlich I believe:

  1. No parameter bounds i.e. no prior intervals: I get good results but only for specific seeds (all four chains mixing well but only for specific seeds). For bad results, traceplots indicate some chains being around the true value while other chains get stuck somewhere else.
  2. Putting some “large” (?) bounds (true value ± 5*SE(true value)) on my parameters – my traceplots stay more consistently “good” regardless of seed and “very small” number of divergence warning messages. There are the very infrequent exceptions where one chain just gets stuck at a value across the parameters.

Still not 100% confirmed, just results from few trial runs.

You can’t trust a model that only works for specific seeds. The concern is that you’re getting bias from not exploring all the posterior.

Did you mean standard deviation here rather than standard error (SE)? You don’t want bounds around SE in the model as that’s just a consequence of how long you run inference.

When you say putting bounds, are they hard interval constraints or soft in that you have something like normal(expected value, 5sd)?

The only thing to do here is play with alternative parameterizations. About the only guidance here is intuition and pairs plots of posterior draws of parameters.

1 Like

Did you mean standard deviation here rather than standard error (SE)? You don’t want bounds around SE in the model as that’s just a consequence of how long you run inference.

The idea was to put big bounds on my parameters like a giant confidence interval (true value ± 5*SE), but it definitely wouldn’t hurt to use sd either (just making the bounds even bigger).

When you say putting bounds, are they hard interval constraints or soft in that you have something like normal(expected value, 5sd)?

Based on my understanding, I may have to classify what I have as hard constraints on my hyperparameters, but it’s probably better if I display an example (should’ve done this earlier, my bad):

parameters {
   real<lower=-7, upper=15> log_bar_thetaA;
   real<lower=0, upper=0.9> omega_thetaA;
   vector<offset=log_bar_thetaA, multiplier=omega_thetaA> [N] log_thetaA;
}

model {
   log_thetaA ~ normal(log_bar_thetaA, omega_thetaA);
}

The only thing to do here is play with alternative parameterizations. About the only guidance here is intuition and pairs plots of posterior draws of parameters.

I will do this, thank you again Bob.

SE is an estimate of the square root of the variance of the posterior mean estimate for a parameter theta[n] whereas SD is the intrinsic posterior standard deviation. Technically, SE = SD / sqrt(ESS), where ESS is the effective sample size. ESS grows linearly with the length of chains used.

Yes, that’s what I meant by hard constraints—there are hard lower and upper bounds beyond which the parameter will not be allowed to stray. I’d be inclined to use softer bounds like this, for which your intervals are roughly the 95% central intervals—you can make that more extreme to 99% or whatever. It usually leads to smoother computation.

real log_bar_thetaA;
real<lower=0> omega_thetaA;
...
log_bar_thetaA ~ normal(4, 5.5);
omega_theta_A ~ lognormal(log(0.2), 1);
1 Like