Refit model with generated data

I am attempting to conduct a futility analysis and have found that the best approach is to utilize the predictive distribution.

A futility analysis involves stopping a study if we observe that the probability of rejecting the null hypothesis by the end of the study is very low. The null hypothesis refers to the probability of direction: we reject the hypothesis if more than 95% of the parameter’s density is above 0. That means, taking into account how many data is left to collect.
For the binomial case, the process is quite straightforward – we simply need to obtain the predictive distribution and identify the point at which the null hypothesis would be rejected. A practical example is explained here: The utility of Bayesian predictive probabilities for interim monitoring of clinical trials - PMC.

However, this method cannot be generalized for other distributions, as we cannot sum up the responses from all individuals as in the binomial case, which is particularly relevant for the negative binomial distribution that I am interested in.

I have developed a script that performs this analysis, but it is so inefficient that it’s practically unusable.

In the first part, it calculates the parameter and generates responses for the sample that is not yet available. Therefore, we generate ‘ndraw’ iterations for the missing samples.

In the second part, it uses the existing sample along with the generated one, refits the model, and tests the null hypothesis. The issue is that because I haven’t found a way to combine this into a single script, we need to refit the model for each of the ‘ndraw’ iterations, resulting in the need for millions of iterations. Is there a more efficient way to accomplish this? Thanks in advance

you might be able to do this in Stan using generated quantities block to create replicated datasets.
see examples in this case study: Hierarchical Partial Pooling for Repeated Binary Trials, section “Multiple Comparisons” and “Ranking”

I have created replicated datasets for every value of the parameter obtained in the regression. However, those are a lot of replicated datasets, where each of them merged with the current data will be used to obtain a new inferred parameter and check the hypothesis. However, as you can see, running a model for every replicated dataset is very inefficient (thousands of model fits).

perhaps I’m missing something here - once you’ve fit the model to the existing sample, your parameters are the prior, and the generated data is just new data, so you can keep going in the generated quantities block and compute the updated parameters given the new data by following the strategy outlined in the Stan Users Guide: Held-Out Evaluation and Cross-Validation
you can also code the hypothesis test in the generated quantities block.
furthermore, if you don’t really care about the new data, just the outcome of the hypothesis test, you can compute the new data in a local block so that it doesn’t get written to the output CSV file.

In those pages, it is only indicated how to evaluate the new data on the estimated likelihood, but not how to recompute “beta coefficient” again (old+new data altogether).

I want to recreate what the paper Sample Size Reestimation by Bayesian Prediction - Ming-Dauh Wang is doing for the “4.1 unknown normal parameters case”, but generalise it for other distributions and priors (specifically, the negative binomial).
Additionally, instead of getting the needed sample size, I would like to get the estimated power (I am interested in both situations).

In the end what I want is the probability that we would reject the null hypothesis in the future, conditioned in the prior and the data we have collected until now. To do that, we need the predictive distribution of future data, and then translate this density to a null hypothesis

in the generated quantities block, you would assign the log probability density to a variable by calling, e.g. normal_lpdf. see: Conventions for Probability Functions. - see example of assigning to a variable here: Conventions for Probability Functions

on further consideration, I don’t think that this is going to tell you anything other than the relationship between the number of observations and the variance - you’re generating more data from your current estimate, so the new data will have the same distribution as what you’ve seen already, but with added uncertainty. the less data you see, the greater the uncertainty. you’re doing a very indirect demonstration of \frac{1}{\sqrt{N}}.