Accelerating models with sufficient stats: adjustment necessary when some conditions don't have repetitions?

(Note: edited to remove reference to hierarchical models specifically)

So the manual shows how models with repeated binary outcomes can have the likelihood part of computation accelerated through use of sufficient stats, and we’ve worked out the same acceleration for repeated Gaussian outcomes.

These strategies work great when repeated observations is ubiquitous at the level of the likelihood (ex. in a hierarchy of schools as groups and students as observations there’s multiple students in each school, in a hierarchy of human respondents on a survey and survey items as observations there’s multiple items for each human, etc.). However, it’s often the case that data contains mostly repeated observations per group, but some hierarchical groups have only one observation. I those cases, I’ve thusfar been reverting back to the non-accelerated “observation-level” computation of the likelihood, assuming that if I naively incremented the target using sufficient stats for those groups with repetitions then separately incremented the target using the observation-level likelihood, that this would not be appropriate (possibly reflecting giving higher weight to one or the other?).

So I guess the first question is: Is it indeed inappropriate to mix sufficient-stats increments and observation-level increments in a model? To be concrete, this would be the minimal binomial version:

data{

    int n_obs_groups_with_reps ; // number of observation groups with repetitions
    array[n_obs_groups_with_reps] int total_observations_per_group ;
    array[n_obs_groups_with_reps] int total_ones_per_group ;

    int n_obs_groups_without_reps ; // number of observation groups without repetitions
    array[n_obs_groups_without_reps] int single_observation_per_group ;

}
parameters{

    real<lower=0, upper=1> theta;

}
model{

    // sufficient-stats increment for those with reps:
    total_ones_per_group ~ binomial( total_observations_per_group , theta ) ;

   // observation-level increment for those without reps:
   single_observation_per_group ~ bernoulli( theta ) ;

}

And for the Gaussian version it would be:

data{

    int n_obs_groups_with_reps ; // number of observation groups with repetitions
    vector[n_obs_groups_with_reps] number_of_observations_per_group ;
    vector[n_obs_groups_with_reps] mean_of_observations_per_group ;
    vector[n_obs_groups_with_reps] variance_of_observations_per_group ;

    int n_obs_groups_without_reps ; // number of observation groups without repetitions
    vector[n_obs_groups_without_reps] single_observation_per_group ;

}
transformed data{

	vector[n_obs_groups_with_reps] sqrt_n = sqrt(number_of_observations_per_group) ;
	vector[n_obs_groups_with_reps] sd_gamma_shape = (number_of_observations_per_group - 1.0) / 2.0 ;

}
parameters{

    real mu;
    real<lower=0> sigma;

}
model{

    mu ~ std_normal() ;
    sigma ~ std_normal() ;

    // sufficient-stats increment for those with reps:
    mean_of_observations_per_group ~ normal( mu, sigma ./ sqrt_n ) ;
    variance_of_observations_per_group ~ gamma(
        sd_gamma_shape 
        , sd_gamma_shape ./ pow(sigma,2) 
    ) ;

   // observation-level increment for those without reps:
   single_observation_per_group ~ normal( mu, sigma ) ;

}

If that naive approach is indeed incorrect, presumably there’s some correction factor that needs to be applied to account for the mixture of increment kinds; any idea what that correction might be? Maybe something like multiplying each increment of the sufficient stats kind by the number of observations that were collapsed to a single stat for that increment?

You shouldn’t need any special correction here. Formally, the sufficient stat works precisely because it differs from the disaggregated likelihood only by a multiplicative constant. For example, a binomial sufficient statistic differs from the product of the disaggregated Bernoulli likelihoods only by a factor of the binomial coefficient. 16.1 Binomial distribution | Stan Functions Reference. In fact, I imagine that Stan’s binomial_lupmf drops this constant binomial coefficient from the likelihood and returns literally the product of the Bernoullis.

To build intuition, note that the more trials you have in a binomial distribution, the more tightly constrained the likelihood becomes around the observed proportion. So the binomial distribution is already weighting conditions by the number of rows that they subsume, and in just the right way. This is why using sufficient statistics does not require perfectly balanced conditions (same number of rows contributing to every sufficient stat). Given that we can use sufficient stats to combine conditions with different numbers of rows in the disaggregated data (say 2, 3, and 10 rows), perhaps it makes sense that there’s no deep similarity between 2, 3, and 10 rows that vanishes if we construct a sufficient stat for just 1 row. Indeed, if we construct a binomial sufficient statistic for a single row we get a binomial with one trial, which is just the Bernoulli. Likewise in the Gaussian case, the sufficient stat for the mean of a normal distribution with just a single observation is just the regular normal likelihood.

1 Like

Ah, perfect! I’m usually burned by failure to appreciate complexities of the MCMC world, so it’s nice to be caught out err’ing in the other direction 😛