Trouble specifying a simple beta model

Dear Stanners,

I’m trying to specify and test a simple beta model prior to modelling my real data. The (simulated) data describes how transition to psychosis depends on age (younger adults are more likely to transition, however there is also higher variance at this age). The challenge is the data doesn’t contain transition rates (\psi), only outcomes (t = 0, 1). So my model looks like:

t_i \sim Bernoulli(\psi_i)

\psi_i \sim Beta(p_i, q_i)

Where location and precision parameters of Beta are a function of age:
\mu_i = \beta_1age
\phi_i = \beta_2age

The shape parameters with weakly informative priors are:
p_i = \mu_i \phi_i + 2
q_i = \phi_i - \mu_i \phi_i + 2

I think, but I’m not sure, that this might be a latent variable problem. Or it may be more akin a measurement error problem…? Part of my difficulty is that I’m not even sure how to describe my problem so I don’t know the search terms or manual pages to focus on.

Anyway, the simulated data I’m trying to model is generated by the R code below:


library("rstan")

inv.logit <- function(x) {
  1/(1 + exp(-x))
}

set.seed(1)
N <- 100
age <- round(rnorm(N, 30, 10))
b1 <- -0.5
b2 <- 0.05
mu <- inv.logit(b1*age)
phi <- exp(b2*age)
p <- mu * phi + 2
q <- phi - mu * phi + 2
psi <- rbeta(N, p, q)
t <- rbinom(N, size = 1, prob = psi)

And the Stan model I wrote to get the posterior distribution of psi and age effects is here:

data {
  int<lower=0> N;                  // number of Subjects
  int t[N];                        // [0, 1] transition outcome
  real age[N];                     // predictor of psi
}
parameters {
  vector[2] coef;                  // psi regression coefficients
  vector<lower=0, upper=1>[N] psi;
}
transformed parameters {
  vector<lower=0, upper=1>[N] mu;  // location parameter for Beta
  vector<lower=0>[N] phi;          // precision parameter for Beta
  vector<lower=1>[N] p;            // shape parameters for Beta
  vector<lower=1>[N] q;
  for (i in 1:N) {
    mu[i] = inv_logit(coef[1]*age[i]);
    phi[i] = exp(coef[2]*age[i]);
    p[i] = mu[i] * phi[i] + 2;
    q[i] = phi[i] - mu[i] * phi[i] + 2;
  }
}
model {
  coef ~ normal(0, 0.5);           // priors
  psi ~ beta(p, q);                // latent variable
  t ~ bernoulli(psi);              // model
}

However when I try to estimate the model with the code below, I get the subsequent error code:

data_list <- list(
  N = N,
  age = age,
  t = t)

m <- stan(model_code=beta_model, data=data_list)
SAMPLING FOR MODEL '5fabd79078f641126f00eee57af2939c' NOW (CHAIN 1).

Gradient evaluation took 7.9e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.79 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
[1] "Error in sampler$call_sampler(args_list[[i]]) : "
[2] "  c++ exception (unknown reason)"                
error occurred during calling the sampler; sampling not done

I’ve tried widening and narrowing the priors on the coefs and shape parameters, adding more coefs (e.g., intercepts), but nothing I’ve done gets past this error. I think it’s something to do with how I’m specifying psi, since psi is an intermediate parameter between coef and t. I’m also wondering whether it is numerical underflow (overflow?), since I’m not using logs where I probably should be. Any help or guidance you can provide would be much appreciated to this motivated but mathematically challenged Stan learner.

Rich

I just realized it may have something to do with the fact I’m not specifying target += anywhere.

Changing the model code to

model {
  coef ~ normal(0, 0.5);                       // priors
  for (i in 1:N) {
    psi[i] ~ beta(p[i], q[i]);                 // latent variable
    target += bernoulli_lpmf(t[i] | psi[i]);   // model
  }
}

gets me a bit further along the estimation, but it still falls over on the 4th chain with the same error message.

Check the output of rbinom(). I’m not at my computer so I can’t check, but I don’t think it has the dimension you expect.

thanks for your suggestion. As far as I can tell rbinom is giving me what I expect: A vector of ones and zeros, of length N. Happy to hear any other suggestions…

If you’re using 2.18, sorry that the error messages are getting swallowed. We’re about to roll out a patch.

It’s easiest for us to help with more information, which would be the full sequence of commands you issued to R.

It’s not about target += — the sampling statements (~) are just syntactic shorthand for target +=.

It’s easier to see the symmetry by defining q_i = (1 - \mu_i) \phi_i + 2, but more efficient to implement as q_i = \phi_i - p_i.

This model isn’t going to be easy to fit—you have lots of degrees of freedom around every data point and the Beta distribution is tricky to fit given that they’re precision-like, rather than variance-like (meaning they tend to drift up without bound, making adaptation hard for samplers).

Bernoulli probabilities closer to 0.5 have more variance, where ones near the edge are low variance. I wasn’t quite sure what you meant by higher variance at a younger age. The compounds beta-bernoulli here will still reduce to a single probability for each age, it just gets there through a more convoluted path.

It also looks like your program is running very slowly for N = 100.

I would’ve started just trying to fit a simple linear regression on age and then look at the error pattern to see if it needs non-linear effects to capture the proper trend.

To debug, you can add print() statements to see if things are underflowing or overflowing, but I fear RStan 2.18.1 may be swallowing those, too. One problem you can run into is if the data are clearly separable so that either p or q wants to run off to infinity. But from the formulation of the model, the posterior should be proper here as far as I can tell.

The specification of psi looks OK. The efficient way to implement this is to marginalize out that psi using the beta-binomial (there’s also a beta-Bernoulli, but we don’t implement that—you’d have to use beta-binomial and just set N = 1). By definition, y \sim \mathsf{BetaBinomial}(y | N, \alpha, \beta) is equivalent to \theta \sim \mathsf{Beta}(\alpha, \beta) and y \sim \mathsf{Binomial}(N, \theta).

Are you sure all those ages you are generating are positive? R uses the scale parameterization of normal, so there’s a 2.5% chance or so that you’ll get a variate below 0, as it’s only two standard deviations below the mean.

Anything beyond +/-10 is very extreme for inverse logit and might as well be infinity.

Basically, make sure your data makes sense before sending it into Stan.

Thank you for your considered comments Bob. I updated the r code in my original question to provide the complete set of commands needed to run the code. I’m running rstan (Version 2.17.4, GitRev: 2e1f913d3ca3), so I guess it isn’t a 2.18 problem.

Some of your queries about the (simulated) data can be answered by a scatter plot of age by psi, which I’ve included below. The plot shows the age values range from 8 to 54, and the variance of psi (where psi is the risk/probability of psychosis) is greater at younger ages than older ages.

I appreciate your point about using a simple linear regression as a starting point, and I will certainly go back and do that and examine the errors in the original dataset I’m trying to model. But I also think the data generation process I’m dealing with looks more like the model I’m trying to specify here. That is, transition to psychosis is a function of an underlying (latent) variable representing risk (psi), which is itself related to age (and symptom levels, and potentially other factors which will ultimately be included in the model as well).

In light of your comments about specifying the model, I tested the following model code, but got the same error:

model{
  coef ~ normal(0, 0.5);                     // priors
  for (i in 1:N) {
    target += beta_binomial_lpmf(t[i] | 1, p[i], q[i]);
  }
}
2 warnings generated.
ld: warning: text-based stub file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation.tbd and library file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation are out of sync. Falling back to library file for linking.

SAMPLING FOR MODEL 'd21d7684b79665a9ce5c934471e70e01' NOW (CHAIN 1).

Gradient evaluation took 0.000105 seconds
1000 transitions using 10 leapfrog steps per transition would take 1.05 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
[1] "Error in sampler$call_sampler(args_list[[i]]) : "
[2] "  c++ exception (unknown reason)"                
error occurred during calling the sampler; sampling not done

Plot of age by psi

plot(age, psi)

Correction: When I tried a simpler model, by replacing \phi_i with \phi, it works! (with some divergence warnings):

data {
  int<lower=0> N;                  // number of Subjects
  int<lower=0, upper=1> t[N];      // [0, 1] transition outcome
  real age[N];                     // predictor
}
parameters {
  real coef;                       // regression coefficient
  real<lower = 0>  phi;
}
transformed parameters {
  vector<lower=0, upper=1>[N] mu;  // location parameter for Beta
  vector[N] p;                     // shape parameters for Beta
  vector[N] q;

  for (i in 1:N) {
    mu[i] = inv_logit(coef*age[i]);
  }
  p = mu * phi + 2;
  q = phi - mu * phi + 2;
}
model {
  coef ~ normal(0, 0.5);                     // priors
  t ~ beta_binomial(1, p, q);
}'

Thank you for your help and suggestions. I think I can begin to build on this now…

This may be something you want to build on, but if that first beta-binomial parameter remains 1, this reduces to

t ~ bernoulli(p / (p + q))

The density is (replacing p with a and b with q for simplicty)

p(t \mid a, b) = \begin{cases} B(a + 1, b) / B(a , b) & \mathrm{if} \ t = 1 \\[4pt] B(a, b + 1) / B(a, b) & \mathrm{if} \ t = 0 \end{cases}

which using

B(a + 1, b) = B(a, b) \cdot \frac{a}{a + b}

and

B(a, b + 1) = B(a, b) \cdot \frac{b}{a + b}

reduces to

p(t \mid a, b) = \begin{cases} a/(a + b) & \mathrm{if} \ t = 1 \\ b/(a + b) & \mathrm{if} \ t = 0 \end{cases}

so that p(t \mid a, b) = \mathsf{Bernoulli}(t \mid a / (a + b)).

1 Like

Thanks Bob. If I understand you correctly, you are saying my model reduces to a much simpler model where t is simply a function of the proportion of a. (I’m assuming your nomenclature is implying that a and b are the usual Beta shape parameters for successes and failures?). …anyway, I think the model simplification you point out is probably still consistent with my hypothesized data generation process, but I’ll think more on it so thank you very much for your thoughts.

This would also let you get rid of inv_logit in the specification of mu.

Also, that loop cna be vectorised, so it’s

mu = inv_logit(coef[1] * age);
...
p[i] = mu .* phi + 2;
...

The .* is elementwise multiplication whereas regular * is the usual matrix arithmetic.