Random sample/subset from vector per iteration

I’m trying to do predictions for unseen subjects (with varying intercepts) in the generated quantities block.

I’d like to try taking a random subset of varying intercepts for the observed subjects to use for the unseen subjects.

I thing this is what “uncertainty” does or refers to in this post.

In R, I would use sample(observed_intercepts, n_new_subjects, replace = TRUE)

Is there a way to accomplish this within a Stan program. I saw mention of using categorical_rng, but I’m unsure how to implement that function and specify the theta simplex.

I’ve estimated intercepts for 30 observed subjects from which to draw samples.

Would something like:

vector[n_new_subjects] Subj_Idx = categorical_rng(rep_vector(1/n_observed_subjects), n_observed_subjects)

Do the trick?

For uniform distribution you would set theta = rep_vector(1/J, J) where J is the number of observed intercepts. Each draw returns a single random int, so you loop n_new_subject times.

But are you sure you don’t want to sample the new intercepts from the distribution you inferred in your random intercept model? (e.g. a normal distribution with mean \mu_\alpha and standard deviation \sigma_\alpha). That is the usual approach to predictions for new subjects, though maybe I’m missing something about your problem, e.g. your new subjects are actually a random sample of your old subjects.

I missed the link to the post I referenced:

I have implemented what you suggest (I think referred to to as “Gaussian” in the above post). I think sampling from the observed intercepts is what’s referred to as “uncertainty”.

Did a link get broken here? The Stan User’s Guide has this description of the posterior predictive distributions and how to compute them:

https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html

What @potash is bringing up is that this is unusual. Usually we do posterior predictive inference with the model we fit.

Close—that needs an int in the declaration. I would strongly discourage using n_observed_subjects and n_obsrved_subejcts as variables if that wasn’t a typo.

If you really want to do sampling the way you say, you can do this, where I’ve used different variables for the above distinction (how many available, how many to select).

int J;  // number of intercepts to use per prediction
vector[K] alpha = ...;  // varying intercepts
array[K] int<lower=0, upper=J> selected = multinomial_rng(rep_vector(1.0 / K), J);
real linear_predictor = to_vector(selected) .* varying_intercepts;

It will basically implement something like “drop out” from ML. Stan won’t really let you do this in the model block for training, though. But beware that this allows each varying intercept to selected zero times, once, or more than once.

Now I don’t know if you want to scale it back to where it was by multiplying the linear predictor by K / J. If you set K = J, then you get the same number and the ratio is 1.

Here’s how I’ve attempted to implement both for comparison. I’m not sure I have the indexing correct:

generated quantities {
    
        // Observed
        
        vector[N] mu = X * b; // Population mean
        vector[N] epred = ((X * b) * sd(Y)) + mean(Y) ; // Expectations
        vector[N] pred = to_vector(normal_rng(mu + r_subj[Subj], sigma)) .* sd(Y) + mean(Y) ; // Predictions
        vector[N] log_lik ; // Log-likelihood per observation
        
        // Log-likelihood
        
        for (i in 1:N){
                log_lik[i] = normal_id_glm_lpdf(y_scaled[i] | [X[i]], r_subj[Subj[i]], b, sigma) ;
        }
        
        // Unobserved
        
        // Linear predictor
        
        vector[N_New] mu_new = X_New * b; // New data population mean
        vector[N_New] epred_new = ((X_New * b) * sd(Y)) + mean(Y) ; // New data expectations
        
        // Predictions
        
        // Uncertainty
        
        array[nSubj_New] int<lower=1> Subj_Idx;

        for (i in 1:nSubj_New){
                Subj_Idx[i] = categorical_rng(rep_vector(1.0/nSubj, nSubj)) ;
        }

        vector[N_New] pred_uncert = to_vector(normal_rng(mu_new + r_subj[Subj_Idx[Subj_New]], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
        
        // "Gaussian"        
        
        vector[N_New] pred_gauss = to_vector(normal_rng(mu_new + to_vector(normal_rng(rep_vector(0, nSubj_New), subj_sd))[Subj_New], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
       
}

Line breaks make this easier to scan on screen. :-). In general, defining intermediate quantities can make things a lot easier to read as they provide in-code documentation for what sub-expressions mean.

I see that you’re randomly indexing into the subjects r_subj with duplication and then adding that to mu_new. Does that mean the ordering of mu_new doesn’t matter?

This is an unusual way to try to do predictive inference. Why not just use standard Bayesian posterior predictive inference? Specifically, why are you randomizing the predictors across dimensions of Why are you randomizing predictors?

What’s going on in that final line with a normal_rng nested within a normal_rng? Is that just mimicking your data generating process? Labeling intermediate quantities here can help with doc.

Using a loop for pred_guass is going to be faster in generated quantities than building up with to_vector as there’s no autodiff structure to trim down with vectorization. The problem with the vector-based operations is that they tend to allocate and fill new containers, which puts memory pressure on the processor, which is often the main bottleneck for statistical code.

I also didn’t understand why there’s a prediction and then prediction uncertainty. Usually we just roll the predictive uncertainty into our posterior predictive inference as follows:

\displaystyle p(\widetilde{y} \mid y) = \int_\Theta p(\widetilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta.

The uncertainty in estimating \theta given y is represented by the posterior, and sampling uncertainty in the model itself is represented by p(\widetilde{y} \mid \theta). With MCMC, we compute as

\displaystyle p(\widetilde{y} \mid y) = \frac{1}{M} \sum_{m=1}^M p^\textrm{rng}_{Y \mid \Theta}(\widetilde{y} \mid \theta^{(m)}) where \theta^{(m)} \sim p(\theta \mid y) is a posterior draw and p_{Y\mid\Theta}^{\textrm{rng}} is a random number generator for the sampling distribution, which we write in full as p_{Y \mid \Theta}(y \mid \theta).

What it looks like your’e doing is changing the distribution p_{Y \mid \Theta} randomly rather than just generating from it.

My intention is to use the model for sample size expectation/estimation.

What I’m attempting is to:

  • fit a model to some pilot data
  • use the fitted model to make predictions for a larger sample of unseen subjects
  • Fit that model hundreds of times
  • evaluate parameter distributions (credible interval width)

Increase number of unobserved subjects

  • repeat

I’m trying to reproduce something akin to allow_new_levels in the linked post.

With intermediate steps, I think it would look like this:

generated quantities {
    
        // Observed
        
        vector[N] mu = X * b; // Population mean
        vector[N] epred = ((X * b) * sd(Y)) + mean(Y) ; // Expectations
        vector[N_New] pred = to_vector(normal_rng(mu + r_subj[Subj], sigma)) .* sd(Y) + mean(Y) ; // Predictions
        vector[N] log_lik ; // Log-likelihood per observation
        
        // Log-likelihood
        
        for (i in 1:N){
                log_lik[i] = normal_id_glm_lpdf(y_scaled[i] | [X[i]], r_subj[Subj[i]], b, sigma) ;
        }
        
        // Unobserved
        
        // Linear predictor
        
        vector[N_New] mu_new = X_New * b; // New data population mean
        vector[N_New] epred_new = ((X_New * b) * sd(Y)) + mean(Y) ; // New data expectations
        
        // Predictions
        
        // Uncertainty
        
        array[nSubj_New] int<lower=1> Subj_Idx;

        for (i in 1:nSubj_New){
                Subj_Idx[i] = categorical_rng(rep_vector(1.0/nSubj, nSubj)) ;
        }
         
        vector[N_New] pred_uncert = to_vector(normal_rng(mu_new + r_subj[Subj_Idx[Subj_New]], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
        
        // "Gaussian"
        
      vector[nSubj_New] r_subj_new;
        
        for (i in 1:nSubj_New){
                r_subj_new[i] = normal_rng(0, subj_sd) ;
        }         
        
        vector[N_New] pred_gauss = to_vector(normal_rng(mu_new + r_subj_new[Subj_New], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
       
}


I should probably opt out of this question as I think I’m just muddying the waters. I can’t make the connection between the linked post, where @paul.buerkner is sampling new levels in a multilevel model (it’s like we have respondents groups into batches for clinical trials, so we can either predict a new member of an existing batch or a whole new batch). That can be used to do this:

The unseen subjects are classified as either being new members of an existing batch or members of a new batch. But I can’t connect that @JLC’s code, which looks like it’s randomizing group effects. Is that because you’re trying to sample a random new person from an existing group and there’s only a single effect to randomize?

Yes, I am trying to randomly sample an estimated parameter (subject-level intercept) and use it for an unseen subject. That’s how I’m interpreting “uncertainty” vs “gaussian”.

I am trying to simulate additional new people to existing pilot data.

I’m afraid I don’t know what that means, so I’ll let other people respond. Sorry for all the confusion!

I’m attempting to implement the default behaviour of an interface in the ‘generated quantities’ block of a Stan model.

My question might be double-barrelled in asking if my understanding, and implementation, of “uncertainty” is accurate.

I think you are implementing the different methods correctly.

To clarify more generally, brms has different way to simulate/predict new levels in multilevel models. I would recommend you using Gaussian because it is the conceptually easiest and does not require using samples from the old levels. That said, the sample_new_levels = "uncertainty" indeed, for each posterior sample, takes values from a random chosen existing group. This way, the distribution of the group-level coefficients can be approximated without using a Gaussian assumption.

Thanks for the input, @paul.buerkner !