I am trying to figure out whether or not my problem can be solved by rstanarm mrp. I am using multilevel hierarchical regression where I need to estimate population slopes and intercepts, which will be used to estimate relative potency at a specific dose level. Since replicates within cells are unequal, I would also need design weights. The data could be considered a hurdle model, in that there are a lot of zeroes (not coming from the positive domain). Proportions of zeroes in each cell are only estimates and thus have variability. Response data needs to be multiplied by proportion of nonzeroes, which in turn affects weights. On top of this I need to transform response data by square root. Since there is somepopulation structure to the zeroes, I could develop a weakly informative prior for the zeroes. each primary hierarchical unit will have its own binomial proportion of zeroes. My question is whether rstanarm MRPcould help me with this problem. I am currently considering a combination of nonparametric and parametric bootstrapping for this problem. My thought there would be for each bootstrap to grab a posterior estimate of nonzero probability from the beta-binomial and then solve for the design weights. Unfortunately I cannot share more specific information about this problem but hopefully provides enough description to determine whether rstanarm could help me. Thanks for considering my problem.

The rstanarm package is only going to help you if the model you want to estimate is among the ones that are implemented in rstanarm. That is mostly separate from the question of whether Multilevel Regression and Poststratification (MRP) can help you make inferences or predictions about a population that differs from the data you condition on. MRP is likely to be helpful regardless of whether you use rstanarm to do the multilevel regression part of it.

Hi Ben,

Thanks for the response. I am using normal, binomial, and beta distributions, but that was not my question. Data values will need to be multiplied by case weights, which are random variables. In addition, I will need to use design weights

(also random variables) something like this in the case of one hierarchical variable: 1/[sigma2_alpha + sigma2_res/nj]. I will be estimating slopes and intercepts for two independent samples using the same two dose levels. I will do something else with the

slopes and intercepts.

My question is whether rstanarm can handle this kind of problem. The design weights and case weights depend upon one another. I hope that I have clarified my question.

Thanks

-Brian

I know there is a weights built into rstanarm like in stan_glm(…, weights = …). But I am not sure if that’s what you need. Otherwise brms supports weights.

It’s case weights that I really need. In my problem they are random. After multiplying data by case weights, I do multilevel modeling. The only way that I can see to do this is with bootstrapping combined with a parametric boot for case

weights. It would be really helpful to know if there is a way to do this with either rstanarm or brms. Adding in modeling weights would be another level, but I would be willing to see how it works with only case weights initially.

There is a `weights`

argument. You can pass whatever you want to it, but it is fixed.

My concern is whether the things you are proposing to do yield valid inferences in any framework.

Case weights are fundamentally Frequentist, in the sense that multiplying by them yields a normal sampling distribution for the mean across datasets that could have been randomly sampled from a population whose numbers of each type of person are known. Multiplying by them yields invalid Bayesian inferences because you do not know that those people in the population would have had the same outcome as the corresponding person in the sample. Thus, when you condition on the data and scale the likelihood up by the number of people like that in the population, you get a posterior distribution that is way too concentrated.

Similarly, anything with parametric or nonparametric bootstrapping is intended to yield a correct sampling distriution of some estimator across datasets. Finding a distribution for the parameters conditional on resampled datasets that are slightly different each time and never exactly observed does not yield a valid posterior distribution for the parameters conditional on observed data. Although these days, something like that can be used to see if there are problems with your model.

Conversely, poststratification is intended to map a posterior distribution conditional on the raw data that you observe to a different population where the share of each type of person is known. In other words, all the reweighting takes place after a valid posterior distribution has been obtained. There is a vignette about MRP in conjunction with rstanarm at

But it does not resemble anything that you are proposing.

Let me try to explain more clearly. We are working with gene expression data from sliced up organs from animals, but expression is not homogeneous. For each organ, we only have a subset of the sliced sections. A portion of the data are

zeroes indicating no expression. Nevertheless measurements should correspond to density of a transduction area, of which the complete collection of slices cover. If for a specific organ, 50% of the data are zeroes, that would mean that our estimate of density

based on non-zeroes should be multiplied by 0.5. However that percentage of 50% is a random variable, as we do not have data for the complete set of sections. The design is hierarchical with two different sample types, test and reference, and two dose levels

(original viral injection into the organ). The goal is to estimate relative potency to test relative to reference at just the high dose. For that we need slopes and intercepts. The data are hierarchical, as within each dose level within each sample type, there

are multiple organs, sections within organs, and sections placed across multiple slides. So there is variability across organs, variability among sections within slides, and variability among slides within organs.

I have tried looking at various vignettes and perhaps did not understand how they could be applied to my problem. I reached out to Stan development to see if I am missing something. So plan B is bootstrap with case weights. I can construct

a beta-binomial posterior for each proportion of organ nonzeroes for generating random case weights. As there is some structure to the pattern of zeroes, my prior can be made weakly informative. So with each bootstrap a random case weight for each organ would

be generated. I was planning to bootstrap only at the highest hierarchical level (as per research in this area).

My first choice would be to do the problem in a completely Basyesian format, but at this point I do not see how. If you good point me to a vignette that would show me how to do my problem within a completely Bayesian framework, I would

be extremely grateful. If you see a flaw in the bootstrap proposal, I would also be grateful for sharing.

Thanks for considering my problem.

Would be able to share a snippet of the data or some simulated data? I think that might help.

OK, I can do that. Is there any possibility of a phone conversation, where I could make everything clear. I am nervous about putting too much info. on the internet. One phone conversation would

Make everything super-clear. Please let me know if that would be possible. Otherwise I will simulate some sample data.

Simulated data would be best! That way everyone knows what assumptions are being made about the data. Or failing that is there a publicly available data set that looks the same?

This is a highly unusual problem complicated by zeroes. I writing up something now. The key is to understand how the zeroes are used.

Here is the structure of the data set. Slides are nested within organs, and sections are nested within slides. Each organ appears in only a single dose and sample. So samples across sample types and dose levels are independent. The big

sources of variability are organs and sections. There is only one measurement per section. Originally there were approximately k sections per organ. All were placed onto slides, but only two sections were selected from each slide to assay. What is key, and

what seems to be hard to get across is that the original k sections encompass an area, which is approximately the same for every organ. It is only the nonzeroes that have meaning. However the response variable needs to be a density measurement of the original

area. If half of the original measurements were zeroes, not accounting for that would produce biased density measurements. The proportion of zeroes in any organ is only an estimate of the original zero proportion of the area. What I am really after are the

slopes and intercepts for test and reference, from which relative potency at the high dose can be estimated. So the response variable that I will use in the analysis is Z=T(response_var)*case_weight, by case_weight is a random variable with its own posterior

distribution. The slopes and intercepts must be based on Z not T(response_var).

I have been using SAS MCMC for Bayesian analysis, and there is no way to do this in that program. I am not sure whether BRMS or RSTANARM can help me, but it would be great if it could. My problem with the vignettes is that they are all

couched in sample survey context, which is not my context. I am looking for something like my problem where regression coefficients need to be estimated but with case weights applied. We are all agreement that I cannot fix the case weights.

The only approach that I have been able to come with is bootstrapping with a posterior calculated for each nonzero proportion for each organ. So for each bootstrap, I would generate a random caseweight for each observation using a beta-binomial

posterior. I would be grateful for pointing me to a vignette that would explain how to do my problem in a pure Bayesian model. I would also be grateful for pointing out any flaws in my bootstrap proposal.

I hope that this helps.

Response_var sample dose organ slide section

123456789 test 1 az1 1 a1

45678 test 1 az1 1 b1

.

.

.

0 ref. 3 bq 4 q4

Could you post a csv of the fake data? It’s a bit hard to read that format. Thanks for clarifying the structure.