Adding a varying intercept for mu in beta_proportion model when mu is a vector of parameters

I have a model as described over here that is a sort of measurement error model when the predictor is a proportion. A simple sim and code is below. Basically true_proportion is a vector of parameters, that are learned from a beta_proportion model where the outcome is an observed_proportion and kappa is observed_count + 1 (the smaller kappa is, the more uncertainty in true_proportion. true_proportion is then used elsewhere in another model (in this case a logistic regression) as the predictor for my outcome of interest. However, I would like to have a varying intercept for group in the beta_proportion part of the model, because in my actual application, I have repeated observed_proportion for each group, and these should be correlated and inform the true_proportion parameters, especially as observed_count varies greatly among observations within groups.
Here is the code for the simulated data set up and simple model.

library(rstan)

#define true proportions and generate observed proportions from the true, using beta distribution scaled by the observed_count
n <- 100
observed_count <- rnbinom(n, mu=50, size=1) + 10
true_proportion <- runif(n, 0.0000001, 0.9999999)
mu <- true_proportion
shape1 <- mu * (observed_count + 1)
shape2 <- (1 - mu) * (observed_count + 1)
observed_proportion <- rbeta(n, shape1, shape2)
observed_proportion <- ifelse(observed_proportion==0, (observed_proportion + 0.0001), observed_proportion)
observed_proportion <- ifelse(observed_proportion==1, (observed_proportion - 0.0001), observed_proportion)

#generate outcome variable with the true proportion as the predictor
a_b <- 0.01
b_b <- 0.7
y_p <- a_b + true_proportion*b_b
y <- rbinom(n, 1, y_p)

#data
d1 <- list(N = n,
           y = y, 
           observed_proportion = observed_proportion, 
           observed_count = observed_count)

#stan model
stan_code <- "
data {
  int<lower=0> N;
  vector[N] observed_count;
  vector[N] observed_proportion;
  int <lower=0, upper=1> y [N];
}
parameters {
  vector<lower=0, upper=1>[N] true_proportion;
  real alpha;
  real beta;
}
model {
  observed_proportion ~ beta_proportion(true_proportion, observed_count + 1);
  y ~ bernoulli_logit(alpha + beta * true_proportion);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
}
"

#fit model
fit1 <- stan(model_code = stan_code, data = d1,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

print(fit1)

It seems to work well and samples well.

Now to add the varying intercept. My initial thought for adding the varying intercept was the following:

#add group to the data
grp <- rep(1:10, each=10)
d2 <- list(N = n,
           y = y, 
           observed_proportion = observed_proportion, 
           observed_count = observed_count,
           grp = grp,
           N_grp = max(grp))

#stan model
stan_code2 <- "
data {
  int<lower=0> N;
  int<lower=1> N_grp;
  vector[N] observed_count;
  vector[N] observed_proportion;
  int <lower=0, upper=1> y [N];
  int <lower=1, upper=N_grp> grp[N];
}
parameters {
  vector<lower=0, upper=1>[N] true_proportion;
  vector<lower=0, upper=1>[N_grp] u;
  real<lower=0> tau_u;
  real alpha;
  real beta;
}
model {
  tau_u ~ normal(0, 0.2);
  u ~ normal(0, tau_u);
  observed_proportion ~ beta_proportion(true_proportion + u[grp], observed_count + 1);
  y ~ bernoulli_logit(alpha + beta * (true_proportion + u[grp]));
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
}
"

#fit model
fit2 <- stan(model_code = stan_code2, data = d2,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

print(fit2)

However, this doesn’t sample. For all the chains I get “does not contain samples”. I am wondering if this is because while true_proportion has the proper constraints, true_proportion + u[grp] does not. I tried to work around this, but my workarounds have all failed.
I am sure this is pretty simple, but I am fairly new to writing Stan code myself. Can anyone help me add a varying intercept to this model? Why won’t it sample as coded? Thanks!

So somehow the act of posting this made me realize a bunch of mistakes in the above code. I had the constraint wrong on u. Also, I guess one can’t have a “varying intercept” if there is no intercept to begin with. So I replaced true_proportion with alpha_p + true_proportion_offset[N].

#stan model
stan_code3 <- "
data {
  int<lower=0> N;
  int<lower=1> N_grp;
  vector[N] observed_count;
  vector[N] observed_proportion;
  int <lower=0, upper=1> y [N];
  int <lower=1, upper=N_grp> grp[N];
}
parameters {
  vector<lower=0, upper=1>[N] true_proportion_offset;
  vector<lower=-1, upper=1>[N_grp] u;
  real<lower=0> tau_u;
  real alpha;
  real beta;
  real<lower=0, upper=1> alpha_p;
}
model {
  tau_u ~ normal(0, 0.2);
  u ~ normal(0, tau_u);
  alpha_p ~ uniform(0, 1);
  observed_proportion ~ beta_proportion(alpha_p + true_proportion_offset[N] + u[grp], observed_count + 1);
  y ~ bernoulli_logit(alpha + beta * (alpha_p + true_proportion_offset[N] + u[grp]));
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
}
"

#fit model
fit3 <- stan(model_code = stan_code3, data = d2,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

print(fit3)

But only 1 chain sampled and the results look all weird, being way off from the simple model for alpha and beta in the logistic regression and having all of the true_proportion_offsets at 0.50.
So this doesn’t seem to work either.

Hi, before I address the question at hand, let me point out a few potential points of confusion.

  • The variable observed_countin your post seems to be intended to represent the sample size. However, intuitively, I would understand observed_count to refer to the number of positive observations in a sample. Your R code to simulate fake data seems to suggest that observed_count is indeed the total sample size. I would advise to rename the variable accordingly.
  • In your simulation code, `true_proportion’ can simply be drawn with bounds at 0 and 1. None of this “0.00001” is necessary.
  • The specification of shape1 and shape2 in your simulation code is flawed. Assuming a uniform prior for the true but known proportion, see here for the definition of the shape parameters is. Note that the “+1” term should happen after multiplication.
  • None of lines with “ifelse” are needed. If zero of observation are positive, the true proportion simply follows a beta distribution with shape parameters 1 and [sample size + 1] (still assuming a uniform prior).
  • In your stan model, you set kappa toobserved_count + 1, which means that you are assuming that the prior for your unknown true_proportionis a beta distribution with both shape parameters set to 0.5 (instead of 1.0 as suggested here). In contrast, you would assume a beta prior with shape parameters of 1.0 and 1.0, this would correspond to kappa = total sample size + 2. After all, kappa = shape1 + shape2.
  • It is not entirely clear from your description what you are thinking of in terms of the generative process for the observed counts and how these groups of counts arise. Either way, if you want to define a hierarchical model for how the binomial counts arise, you are easier off just specifying a hierarchical logistic regression model with normal distributions for the group-level effects. The estimated log-odds of positives in each group level and each set of samples within a group can then be simply translated to a proportion with the inverse logit transformation.

Thanks for the response!

This is indeed what it represents. In my application, I want to scale the uncertainty by the sample size, not by the number of positive observations.

For the simulation, I wasn’t aware of the formulation you gave for the beta distribution, and was trying to think of it in terms of mu and phi, as here. I was simply trying to think in some way about phi in terms of sample size. It may well be flawed haha

Maybe it would help if I describe the actual application that I am interested in. I have data where I have samples taken from various sites (many samples at different time points per site). These samples can be either positive or negative. I know the number of samples taken at each site, and this varies greatly both within sites from timepoint to timepoint and between sites. In addition, the proportion of positive samples varies greatly (zero to one) from site to site and time to time. We are interested in how the proportion of samples that are positive predicts a case of a rare disease condition at the site.
The reason that I wanted to scale the uncertainty in the proportion predictor by the sample size, is that we know that small samples sizes have error, because while the samples are supposed to be SRS, small sample sizes very likely represent some sort of convenience sample taken by technicians and often contain a high proportion of positives.

So what I was wanting was a way to scale the uncertainty in the proportion predictor by the sample size, and then enter this true_proportion with uncertainty, as the predictor in a logistic regression model for my primary outcome of rare disease case. In addition, I wanted the beta model with true_proportion such that proportions from the same site were correlated, as I have many observations per site, and a small sample size at a single timepoint should be informed by the other timepoints at that site.

I recommend that you “just” use a hierarchical logistic regression model with normally distributed random effects to model the true proportions and the correlation of these proportions within sites. Then you can use the estimated site-level true proportions as a site-level predictor in your model for the rare disease outcome. Note that I’m guessing that your outcome will be modeled at site-level (and not per time point per site).

It sounds like you are suggesting some sort of joint model, but I can’t really envision what you are suggesting. How would you model the true proportion with a hierarchical logistic regression? What I have is the observed proportion. Are you saying model the observed proportion with an intercept and random effect for site, and then make predictions for the proportion by site, and then use these predictions as the true proportion in the logistic regression model for disease case?

From what I understand, you have more than just this: the number of samples taken and the number of samples that was positive (i.e., not just the proportion but the actual counts). If that is so, you should be able to define a logistic regression model, correct?

Yes, that is correct. In fact I have binary data per sample row. So, say a site had 20 samples taken, then I have the 0/1 (neg/pos sample) for each sample. So, in lme4 or brms syntax, I can fit something like,
sample ~ 1 + (1|site)
in a logistic regression model, which sounds like what you are suggesting. But I am not sure how I then use the per-site proportion of sample positives with uncertainty (let’s call that true_proportion) as a predictor in my outcome model, case ~ 1 + true_proportion, where case is binary and on a per-site, not per-sample, basis. Where would I get true_proportion from?

As I mostly write my own Stan models, I don’t know how to do this in brms. If there is documentation on it, I would imagine it’s called something like “errors in variables”.

If you want to write your own Stan model, go for it. The model should include a hierarchical logistic model for the sample-level (i) and site-level (j) expected proportion p_{ij}, which would look something like:

logit(p_{ij}) = \beta_0 + \beta_j + \beta_{ij}

Here, the last two terms are the site-level and sample-level deviations from the overall mean and site-level logit-probability, respectively.

You can then use the inverse logit of \beta_0 + \beta_j (or any transformation of it) as a site-level predictor for your outcome of interest. Or even use \beta_0 + \beta_j + \beta_{ij} as a sample-level predictor, if that’s what you’re looking for.

EDIT: equation formatting

1 Like

Gotcha! I thought that is along the lines of what you might be suggesting, and I actually tried to write something close to that yesterday. However, I was using intercept + random_effect[site] parameters as my predictor in the other model. To instead use inverse logit of it, would I put that in the transformed parameters block in the Stan code?

You can, but certainly don’t have to. You can just use and apply whatever transformation in your model block as well. The only benefit of doing it the transformed parameters block is that it keeps a copy for you in the output (which takes space and costs I/O overhead). Unless there is a specific reason why you would need that, I don’t see the real benefit of putting it in the transformed parameters block.

1 Like

Ok, here is my attempt at the Stan code for the joint model. Note that the data sim is not from a generative model but basically just a way to test if my code runs (I’m new to writing Stan code). It seems to work - at least it runs and the output makes sense.

library(rstan)

#prepare data for Stan model
d3 <- list(M = 100, #number of sites
           N = 4000, #10 sampling obs for each of 100 sites taken at 4 timepoints
           y = rbinom(4000, 1, 0.3), #outcome of sampling
           g = rep(1:100, each=10, times=4), #site id for each sample    
           P = 400, #4 disease outcome observations per site (1 for each timepoint)
           h = rep(1:100, each=4), #site id for each disease outcome
           y_c = rbinom(400, 1, 0.05)) #disease outcome


stan_code3 <- "
data {
    int N; // number of obs model 1
    int M; // number of sites
    int P; // number of obs model 2 
    int y[N]; // site samples outcome
    int y_c[P]; // disease outcome
    int g[N];    // site id for sampling outcome data
    int h[P];  // site id for disease outcome data
}
parameters {
    real alpha;
    real a[M]; 
    real<lower=0> sigma; 
    real alpha_d;
    real b_d; 
}
model {
  alpha ~ normal(0, 5);
  sigma ~ normal(0, 2.5);
  alpha_d ~ normal(0, 10);
  b_d ~ normal(0, 2.5);
  a ~ normal(0, sigma);
  for(n in 1:N) {
    y[n] ~ bernoulli_logit(alpha + a[g[n]]);
  }
  for(p in 1:P) {
    y_c[p] ~ bernoulli_logit(alpha_d + (inv_logit(alpha + a[h[p]]))*b_d);
  }
}
"

#fit model
fit3 <- stan(model_code = stan_code3, data = d3,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

print(fit3)

This is probably a dumb question, but how do I know that the correct varying intercept for a given site g from the y[n] ~ bernoulli_logit(alpha + a[g[n]]) model is entering in for the corresponding site h in the y_c[p] ~ bernoulli_logit(alpha_d + (inv_logit(alpha + a[h[p]]))*b_d) model? The two models have two different sized datasets, so in the data block I declared g[N] and h[N] as site id’s for the two datasets. But really these are the same id (i.e. g=10 in one dataset corresponds to h=10 in the other; same site). I just assumed that you couldn’t declare the site id twice with different lengths like g[N] and g[P]. …but now I am wondering if Stan knows g=h for site.