Random sample/subset from vector per iteration

JLC · October 14, 2024, 2:44pm

I’m trying to do predictions for unseen subjects (with varying intercepts) in the generated quantities block.

I’d like to try taking a random subset of varying intercepts for the observed subjects to use for the unseen subjects.

I thing this is what “uncertainty” does or refers to in this post.

In R, I would use sample(observed_intercepts, n_new_subjects, replace = TRUE)

Is there a way to accomplish this within a Stan program. I saw mention of using categorical_rng, but I’m unsure how to implement that function and specify the theta simplex.

I’ve estimated intercepts for 30 observed subjects from which to draw samples.

Would something like:

vector[n_new_subjects] Subj_Idx = categorical_rng(rep_vector(1/n_observed_subjects), n_observed_subjects)

Do the trick?

potash · October 14, 2024, 3:29pm

For uniform distribution you would set theta = rep_vector(1/J, J) where J is the number of observed intercepts. Each draw returns a single random int, so you loop n_new_subject times.

But are you sure you don’t want to sample the new intercepts from the distribution you inferred in your random intercept model? (e.g. a normal distribution with mean \mu_\alpha and standard deviation \sigma_\alpha). That is the usual approach to predictions for new subjects, though maybe I’m missing something about your problem, e.g. your new subjects are actually a random sample of your old subjects.

JLC · October 14, 2024, 4:19pm

I missed the link to the post I referenced:

I have implemented what you suggest (I think referred to to as “Gaussian” in the above post). I think sampling from the observed intercepts is what’s referred to as “uncertainty”.

Bob_Carpenter · October 14, 2024, 4:21pm

Did a link get broken here? The Stan User’s Guide has this description of the posterior predictive distributions and how to compute them:

https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html

What @potash is bringing up is that this is unusual. Usually we do posterior predictive inference with the model we fit.

Close—that needs an int in the declaration. I would strongly discourage using n_observed_subjects and n_obsrved_subejcts as variables if that wasn’t a typo.

If you really want to do sampling the way you say, you can do this, where I’ve used different variables for the above distinction (how many available, how many to select).

int J;  // number of intercepts to use per prediction
vector[K] alpha = ...;  // varying intercepts
array[K] int<lower=0, upper=J> selected = multinomial_rng(rep_vector(1.0 / K), J);
real linear_predictor = to_vector(selected) .* varying_intercepts;

It will basically implement something like “drop out” from ML. Stan won’t really let you do this in the model block for training, though. But beware that this allows each varying intercept to selected zero times, once, or more than once.

Now I don’t know if you want to scale it back to where it was by multiplying the linear predictor by K / J. If you set K = J, then you get the same number and the ratio is 1.

JLC · October 14, 2024, 5:36pm

Here’s how I’ve attempted to implement both for comparison. I’m not sure I have the indexing correct:

generated quantities {
    
        // Observed
        
        vector[N] mu = X * b; // Population mean
        vector[N] epred = ((X * b) * sd(Y)) + mean(Y) ; // Expectations
        vector[N] pred = to_vector(normal_rng(mu + r_subj[Subj], sigma)) .* sd(Y) + mean(Y) ; // Predictions
        vector[N] log_lik ; // Log-likelihood per observation
        
        // Log-likelihood
        
        for (i in 1:N){
                log_lik[i] = normal_id_glm_lpdf(y_scaled[i] | [X[i]], r_subj[Subj[i]], b, sigma) ;
        }
        
        // Unobserved
        
        // Linear predictor
        
        vector[N_New] mu_new = X_New * b; // New data population mean
        vector[N_New] epred_new = ((X_New * b) * sd(Y)) + mean(Y) ; // New data expectations
        
        // Predictions
        
        // Uncertainty
        
        array[nSubj_New] int<lower=1> Subj_Idx;

        for (i in 1:nSubj_New){
                Subj_Idx[i] = categorical_rng(rep_vector(1.0/nSubj, nSubj)) ;
        }

        vector[N_New] pred_uncert = to_vector(normal_rng(mu_new + r_subj[Subj_Idx[Subj_New]], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
        
        // "Gaussian"        
        
        vector[N_New] pred_gauss = to_vector(normal_rng(mu_new + to_vector(normal_rng(rep_vector(0, nSubj_New), subj_sd))[Subj_New], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
       
}

Bob_Carpenter · October 14, 2024, 6:33pm

Line breaks make this easier to scan on screen. :-). In general, defining intermediate quantities can make things a lot easier to read as they provide in-code documentation for what sub-expressions mean.

I see that you’re randomly indexing into the subjects r_subj with duplication and then adding that to mu_new. Does that mean the ordering of mu_new doesn’t matter?

This is an unusual way to try to do predictive inference. Why not just use standard Bayesian posterior predictive inference? Specifically, why are you randomizing the predictors across dimensions of Why are you randomizing predictors?

What’s going on in that final line with a normal_rng nested within a normal_rng? Is that just mimicking your data generating process? Labeling intermediate quantities here can help with doc.

Using a loop for pred_guass is going to be faster in generated quantities than building up with to_vector as there’s no autodiff structure to trim down with vectorization. The problem with the vector-based operations is that they tend to allocate and fill new containers, which puts memory pressure on the processor, which is often the main bottleneck for statistical code.

I also didn’t understand why there’s a prediction and then prediction uncertainty. Usually we just roll the predictive uncertainty into our posterior predictive inference as follows:

\displaystyle p(\widetilde{y} \mid y) = \int_\Theta p(\widetilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta.

The uncertainty in estimating \theta given y is represented by the posterior, and sampling uncertainty in the model itself is represented by p(\widetilde{y} \mid \theta). With MCMC, we compute as

\displaystyle p(\widetilde{y} \mid y) = \frac{1}{M} \sum_{m=1}^M p^\textrm{rng}_{Y \mid \Theta}(\widetilde{y} \mid \theta^{(m)}) where \theta^{(m)} \sim p(\theta \mid y) is a posterior draw and p_{Y\mid\Theta}^{\textrm{rng}} is a random number generator for the sampling distribution, which we write in full as p_{Y \mid \Theta}(y \mid \theta).

What it looks like your’e doing is changing the distribution p_{Y \mid \Theta} randomly rather than just generating from it.

JLC · October 14, 2024, 6:48pm

My intention is to use the model for sample size expectation/estimation.

What I’m attempting is to:

fit a model to some pilot data
use the fitted model to make predictions for a larger sample of unseen subjects
Fit that model hundreds of times
evaluate parameter distributions (credible interval width)

Increase number of unobserved subjects

repeat

I’m trying to reproduce something akin to allow_new_levels in the linked post.

With intermediate steps, I think it would look like this:

generated quantities {
    
        // Observed
        
        vector[N] mu = X * b; // Population mean
        vector[N] epred = ((X * b) * sd(Y)) + mean(Y) ; // Expectations
        vector[N_New] pred = to_vector(normal_rng(mu + r_subj[Subj], sigma)) .* sd(Y) + mean(Y) ; // Predictions
        vector[N] log_lik ; // Log-likelihood per observation
        
        // Log-likelihood
        
        for (i in 1:N){
                log_lik[i] = normal_id_glm_lpdf(y_scaled[i] | [X[i]], r_subj[Subj[i]], b, sigma) ;
        }
        
        // Unobserved
        
        // Linear predictor
        
        vector[N_New] mu_new = X_New * b; // New data population mean
        vector[N_New] epred_new = ((X_New * b) * sd(Y)) + mean(Y) ; // New data expectations
        
        // Predictions
        
        // Uncertainty
        
        array[nSubj_New] int<lower=1> Subj_Idx;

        for (i in 1:nSubj_New){
                Subj_Idx[i] = categorical_rng(rep_vector(1.0/nSubj, nSubj)) ;
        }
         
        vector[N_New] pred_uncert = to_vector(normal_rng(mu_new + r_subj[Subj_Idx[Subj_New]], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
        
        // "Gaussian"
        
      vector[nSubj_New] r_subj_new;
        
        for (i in 1:nSubj_New){
                r_subj_new[i] = normal_rng(0, subj_sd) ;
        }         
        
        vector[N_New] pred_gauss = to_vector(normal_rng(mu_new + r_subj_new[Subj_New], sigma)) .* sd(Y) + mean(Y) ; // New data predictions
       
}

Bob_Carpenter · October 14, 2024, 6:56pm

I should probably opt out of this question as I think I’m just muddying the waters. I can’t make the connection between the linked post, where @paul.buerkner is sampling new levels in a multilevel model (it’s like we have respondents groups into batches for clinical trials, so we can either predict a new member of an existing batch or a whole new batch). That can be used to do this:

The unseen subjects are classified as either being new members of an existing batch or members of a new batch. But I can’t connect that @JLC’s code, which looks like it’s randomizing group effects. Is that because you’re trying to sample a random new person from an existing group and there’s only a single effect to randomize?

JLC · October 14, 2024, 7:01pm

Yes, I am trying to randomly sample an estimated parameter (subject-level intercept) and use it for an unseen subject. That’s how I’m interpreting “uncertainty” vs “gaussian”.

I am trying to simulate additional new people to existing pilot data.

Bob_Carpenter · October 14, 2024, 7:10pm

I’m afraid I don’t know what that means, so I’ll let other people respond. Sorry for all the confusion!

JLC · October 14, 2024, 9:26pm

I’m attempting to implement the default behaviour of an interface in the ‘generated quantities’ block of a Stan model.

My question might be double-barrelled in asking if my understanding, and implementation, of “uncertainty” is accurate.

paul.buerkner · October 15, 2024, 5:59am

I think you are implementing the different methods correctly.

To clarify more generally, brms has different way to simulate/predict new levels in multilevel models. I would recommend you using Gaussian because it is the conceptually easiest and does not require using samples from the old levels. That said, the sample_new_levels = "uncertainty" indeed, for each posterior sample, takes values from a random chosen existing group. This way, the distribution of the group-level coefficients can be approximated without using a Gaussian assumption.

JLC · October 15, 2024, 1:21pm

Thanks for the input, @paul.buerkner !

Topic		Replies	Views
Random draws of data vector Modeling	4	411	March 10, 2023
Varying intercepts with single level group effect Modeling fitting-issues , specification , brms	3	64	December 13, 2024
Sample from a custom distribution in the generated quantiles block Modeling posterior-predictive , rng	3	449	July 1, 2022
Prediction for a new observation in a new group Modeling	2	515	February 4, 2019
NLME model in STAN for missing + unbalanced longitudinal data Modeling rstan , mixed-model , missing-data	20	368	October 21, 2024

Random sample/subset from vector per iteration

Related topics