Basic model - Estimate variables in probit model and level height + noise simultaneously

As a true beginner to Stan I’ve been working on a very basic probit model. It works well, please see R code and Stan code below.

However, I want to extend the model where [0,1] is a latent state reflecting the probability of reaching a level with a specific height. Next to estimating the variables of the probit model, I also want to include the estimates of the height of the level (in simulated example below 3 +/- measurement noise).

How do I need to extend the current Stan model to efficiently add the “level_height” and “noise” term?

Thank you very much! Thomas

R code:

# Number of attempts  
N <- 300
# Input parameters
alpha <- 10
beta <- 1.5
level_height <- 3

# Random noise
e<-0.1
noise <- e*runif(N)-e/2
# Simulation of region where level becomes activated 
x_val <- alpha + (6*beta*runif(N)-3*beta)

# Active or inactive level
# Only 0's or 1's
y_act <-pnorm(x_val,alpha,beta)>runif(N); 

# Add height of level
# Includes level height + noise
y_level <-y_act*level_height + noise

Stan code (works well for y_act):

data {
int<lower=1> N;
vector[N] x;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0> alpha;
real<lower=0.1,upper=10>beta;
}
model {
vector[N] eta;
eta = 1/beta * (x - alpha);
alpha ~ normal(0,10);
beta ~ cauchy(0,10);
for (i in 1:N)
{
  y[i] ~ bernoulli(Phi(eta[i]));
}
}

Stan code (starting point for “y_level”):

data {
int<lower=1> N;
vector[N] x;
vector[N] y;
}
parameters {
real<lower=0> alpha;
real<lower=0.1,upper=10>beta;
real<lower=0.1,upper=10> levelheight;
real<lower=0.01> noise;
}
model {
vector[N] eta;
eta = 1/beta * (x - alpha);

alpha ~ normal(0,10);
beta ~ cauchy(0,10);
levelheight ~ normal(0,10);
noise ~ cauchy(0,5);

for (i in 1:N)
{
  y[i] ~ bernoulli(Phi(eta[i]));
}
}
1 Like

Hi, sorry, your question fell through despite being relevant and well written.

Generally Stan does not support discrete unobserved parameters, but, in this case this is not a big limitation. One fruitful way to think about the model is that it is a mixture of two distributions - one with mean 0 and the other with mean level_height. I would be surprised if uniformly distributed noise as you have in the simulation would actually capture your data well, but maybe you have a reason to believe this is the case? It could also cause problem Note that with such a noise there will be many observations where you know exactly whether they came from y = 0 or y = 1 and some y values would be impossible.

So I would find it more likely (and easier to implement) if you had normal noise. In this case, you could have something like:

...

parameters {
   ....
   real<lower=0> level_height;
   real<lower=0> sigma;
}
...


model {
....
for (i in 1:N)
{
  target += log_mix(Phi(eta[i]), 
                       normal_lpdf(y[i] | 0, sigma), 
                       normal_lpdf(y[i] | level_height, sigma)
  );
}
}

See 5 Finite Mixtures | Stan User’s Guide for more theory on mixture models in Stan.

Also note that the lower and upper bounds you use are somewhat suspicious - unless you have a physical reason to believe that say beta cannot be lower than 0.1 or larger than 10, it is usually more sensible to enforce soft constraints via prior. Hard constraints should be reserved for stuff like ensuring a parameter is positive (in which case a lower bound of 0 is preferable)

Best of luck with your model!

Dear Martin,

Thank you very much for your help! I think it would also be useful if I add a “generated quantities” block. I had some issues with priors previously where it didn’t fell within the predefined prior distribution (e.g. beta ~ normal(2,0.4), where the samples within all chains remained at +/- 4, despite removing hard constraints). Anyway, I will work on an improved version of the model given your useful suggestions!

Thanks again!

Best,
Thomas

This looks like there is a conflict between prior and the data - this is definitely at least worth some investigation. In any case putting a hard constraint to avoid prior-data conflict is usually not a good strategy, you are basically just saying “whatever the data says, I will prefer my prior information”, in which case you might as well not use the data at all…

Does that make sense?

Dear Martin,

Thank you again for the comments. For clarity I’ve added again my current R code and Stan code. The issues with fitting somehow remains. You had a good point regarding “noise”, so I have now defined “noise” differently in the R code.

R code:

# Number of attempts
N <- 300

# Input parameters
alpha <- 10
beta <- 1.5
level <- 3

# Random normal noise
e<-0.05
noise <- rnorm(N,0,e)

# Simulation of region where level becomes activated 
x_val <- alpha + (6*beta*runif(N)-3*beta)

# Active or inactive level
y_act <-pnorm(x_val,alpha,beta)>runif(N);

# Add height of level
y_level <-y_act*level + noise

Some background information regarding the Stan code below: All parameters have a lower boundary of 0. For “beta” I have chosen to set it at “0.1” due to the definition of “eta” to prevent searching close to 1/0 (infinity). I had the impression that it worked better. Additionally, to prevent that “noise” converges to “level_height”, so to have issues related to parameter identifiability I also put a hard upper constraint for “noise” based on the data (with “maxNoise”). The priors are rather informative and close to the simulated values, so I have the impression they are defined correctly.

Where can I improve my coding to adequately fit the parameters in Stan?

Thank you!

Stan code (x–> “x_val” & y–> “y_level”):

data {
int<lower=1> N;
vector[N] x;
vector[N] y;
}
transformed data {
real maxNoise;
maxNoise = 0.1*max(y);
}
parameters {
real<lower=0> alpha;
real<lower=0.1> beta;
real<lower=0> level_height;
real<lower=0, upper=maxNoise> noise;
}
model {
vector[N] eta;
eta = 1/beta * (x - alpha);
alpha ~ normal(12,2);
beta ~ normal(2,0.2);
level_height ~ normal(4,2);
noise ~ normal(maxNoise/2,maxNoise);
for (i in 1:N)
{
 target += log_mix(Phi(eta[i]), 
                       normal_lpdf(y[i] | 0, noise), 
                       normal_lpdf(y[i] | level_height, noise)
  );
}
}

Great, you are welcome :-)

Could you be more specific? Is this a problem with compiling the model? Or do you get divergences or other warnigns when fitting? Or does fitting proceed without warnings, but the fitted values seem wrong?

I played with the code a bit and it seems the main problem is in initialization: Stan does, by default, initialize the parameters between -2 and 2 on the unconstrained scale, so for real<lower=0> x this turns out to be between exp(-2) and exp(2) (see 10.2 Lower bounded scalar | Stan Reference Manual for a bit more details). This can however produce values that are in the extreme tails of your priors - especially for alpha more than half of the inits will be more than 5 * sd from the prior mean. This then causes numerical instabilities from which some chains never recover. When using init centered on the prior for alpha and beta, e.g. init = function() { list(alpha = 10 + rnorm(1), beta = 2 + rnorm(1, sd = 0.2)) all of my chains behave nicely.

This sounds a bit weird… I wouldn’t expect identifiability issues with large noise (but maybe I am wrong). I would first check if everything else in the model is correct.

Note that you can use triple backticks (```) to format code nicely. I took the liberty to edit your post for you.

Best of luck with the model!

Good point. Compiling goes fine, but as you said (1) either fitting proceed without warning (Rhat = 1), but fitted values are completely off (chains do not recover), or (2) fitting shows Rhat>>1, due to combination of chains that recover and do not recover (I checked this by “traceplot”)

As you suggested, it may have to do with the initialization from which the chains do never recover. Therefore, I have implemented the initialization, however I still get the same issues.
Compiling looks like:
model1 <- stan_model(file = "basicmodel.stan")
fit <-sampling(model1, data = list(x = x_val, y = y_level, N = length(x_val)),chains = 4, iter = 1000, warmup = 500, init = init_fun)
with “init_fun”:
init_fun <- function() {
list(alpha = 10 + rnorm(1), beta = 2 + rnorm(1, sd = 0.2))
}

E.g. when fitting goes fine, I get values that are completely off, e.g. with in Stan a prior:
alpha ~ normal(12,2);
and initialization function using:
...alpha = 10 + rnorm(1)..
I still getting strange values for “alpha”. Actually, I also do not want to inform R/stan too much on parameter values, because in reality these values are unknown. The “basicmodel.stan”-file matches exactly with the posted version. I don’t get where it goes wrong, maybe higher/longer “warmup” and “iter”? Although it is not a very challenging model.

Thanks for your help!

Could you explain this statement a little but in more detail.
With “alpha” in the stan code as
real<lower=0> alpha;
and
alpha ~ normal(12,2);
A prior mean at 12 and a lower boundary at 0 are far away, but I had in mind that Stan uses the informative prior to adequately search the parameter space. Why is Stan looking at the tails?
I removed the lower boundaries in the code, but unfortunately the chain issues remain present. Thanks for your feedback/explanation!

This is a common conceptual misunderstanding about Stan. Stan does much less magic than many users suspect :-) In fact a Stan program just computes a function of the parameters (that should represent the log density = log likelihoood + log prior) and (behind the scenes) its derivatives. So when you write alpha ~ normal(12, 2); Stan does not intepret this as “alpha should be normal with mean 12 and sd 2”, instead it is the same as target += normal_lupdf(alpha | 12, 2), i.e. just a value that’s added to the target which is the density that is computed. No magic. This allows Stan to be fast and implement much more models than it could if it needed to somehow “understand” the model.

So in particular Stan does not (and also cannot) “look” at your code and use the prior for initialization. This usually does not matter as for well-behaved posteriors Stan is quickly to find a good region from basically any initial position.

I just ran your model using your code in latest CmdStan via cmdstanr (for stupid reasons, I currently don’t have a working rstan installation) and those are the results I get:

    variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
 lp__         19.61  19.94 1.41 1.22 16.94 21.26 1.00     1596     2742
 alpha        10.21  10.22 0.27 0.27  9.76 10.66 1.00     3149     2817
 beta          4.12   4.11 0.14 0.14  3.90  4.34 1.00     3413     2650
 level_height  3.00   3.00 0.00 0.00  3.00  3.01 1.00     5275     2887
 noise         0.05   0.05 0.00 0.00  0.05  0.06 1.00     3538     2427

So this looks like almost perfect recovery except for beta. Are you getting the same results as me? It is also possible that some of the numerical instabilities that you encountered were fixed in later Stan version (rstan currently cannot use latest Stan, for stupid reasons).

If I get it correctly, there is a discrepance between your model and the code you use to simulate data. In your simulation you have:

y_act <-pnorm(x_val,alpha,beta)>runif(N);
y_level <-y_act*level + noise

So the larger x the higher chance of y_act being true, the larger chance of observing the level value.

But in the Stan model you have:

eta = 1/beta * (x - alpha);
....
 target += log_mix(Phi(eta[i]), 
                       normal_lpdf(y[i] | 0, noise), 
                       normal_lpdf(y[i] | level_height, noise)

So the larger x, the larger eta, the larger chance that you select the first component, which is the lower one, with mean 0. If I flip the clauses, e.g.:

 target += log_mix(Phi(eta[i]), 
                       normal_lpdf(y[i] | level_height, noise),
                       normal_lpdf(y[i] | 0, noise)

The results are:

     variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
 lp__         366.49 366.81 1.39 1.20 363.74 368.11 1.00     1893     2606
 alpha         10.15  10.15 0.18 0.18   9.85  10.44 1.00     2842     2283
 beta           1.86   1.86 0.14 0.14   1.64   2.10 1.00     3164     2369
 level_height   3.00   3.00 0.00 0.00   2.99   3.00 1.00     5056     2760
 noise          0.05   0.05 0.00 0.00   0.05   0.05 1.00     3480     2596

So beta is now recovered much better. The difference is probably only because your prior actually excludes 1.5 quite heavily (a prior there is only around 0.6% probability for beta <= 1.5), so you get the expected push of the posterior towards the prior.

For your worry about priors: if the model is sensible and you have reasonable amount of data, the model will very likely work well even with quite broad (e.g. “weakly informative”) priors.

Best of luck with your model!

2 Likes

Exactly, this is what I get! Only some issues with “beta”, but in the meantime I resolved it (see below).

I completely overlooked this discrepancy. I thought “true” belongs to second element, but fantastic. I’ve changed it and now I get the same results as you!

As mentioned at the start, I’m just a beginner with Stan. The basic model now runs with a single level, but it should also work with multiple active levels (and heights). The next step also includes adding a “generated quantities”-block to verify the output of the fitted model. I will work on it further.

Thanks a lot for your help and patience.

1 Like

Dear Martin,

I have been working on the next step, where as an extension of the previous basic model, the input parameters (alpha, beta, level) are not of length 1, but of length 2 (for now may also be larger).

The corresponding R code:

# Number of attempts
N <- 400

# Input parameters
alpha <- c(10,14)
beta <- c(1.5,3)
level <- c(3,1)

# length of alpha, beta, and level_heights
# K = 2 (in this case, but can also be larger)
K = length(alpha) 

# Random normal noise
e<-0.05
noise <- rnorm(N,0,e)

# Simulation of region where levels become activated 
x_val <-mean(alpha) + (6*max(beta)*runif(N)-3*max(beta))

# Active or inactive level
y_act_new<-matrix(nrow = K, ncol = N)
for (i in 1:K){
y_act_new[i,] <-as.numeric(pnorm(x_val,alpha[i],beta[i])>runif(N));
}

# Add height of level
y_level_new<-colSums(y_act_new * level) + noise 

# Output plot where number of potential observable levels becomes 2^K
plot(x_val,y_level_new)

# Generate initial values as input for stan model
init_fun <- function() { 
   list(alpha = mean(alpha) + rnorm(2, sd = 0.2), beta = 2 + rnorm(2, sd = 0.2))
}



The parameter set (alpha, beta, level) now doubles (6 parameters need to be estimated) + the single noise parameter (=7), for K = 3 ( = 10). However, the number of potential observable levels (“mixtures”) at the output becomes 2^K. If K increases, these observable levels also increase exponentially. I want make the stan code, such that the model can estimate this parameters set for varying K (to keep it simple, now K = 2).

My starting stan-code (as extension of the basic model, x → “x_val”, y–> “y_level_new”):

data {
int<lower=1> N;
vector[N] x;
vector[N] y;
}
transformed data {
int K;
K = 2; // length of alpha, beta, and level
}
parameters {
vector<lower=0>[K] alpha;
vector<lower=0.1>[K] beta;
vector<lower=0>[K] level_height;
real<lower=0> noise;
}
model {
vector[N] eta[K];
//eta = 1/beta * (x - alpha);
alpha ~ normal(12,2);       // Hyper-priors
beta ~ normal(2,1);         // Hyper-priors
level_height ~ normal(4,2); // Hyper-priors
noise ~ normal(0,0.1);
for (i in 1:N)
    {
    for  (n in 1:K)
    {
    real lp[K];
    eta[n] = 1/beta[n] * (x - alpha[n]);   
     lp[n] = log_mix(Phi(eta[i,n]),
                  normal_lpdf(y[i] | level_height[n], noise), 
                  normal_lpdf(y[i] | 0, noise));
     target+=log_sum_exp(lp);
    }
}

I’m stuck with the code provided above (still error at compiling, but thought to share it anyways). “y[i]” can now also be combinations of “level_height”, so for K=2: 0, level_height[1], level_height[2], level_height[1]+level_height[2] → 2^2=4. However, if K = 3, there will be 2^3 (= 8) combinations. So I’m stuck how to flexibly code this in stan with “log_mix”, or maybe something else then “log_mix” is required to adequately code the extended model.

Thanks again for the input!

Hi,
so unless you have a lot of data, the model would IMHO be pretty hard to fit for larger K. So I don’t think it’s that terrible to just hardcode all the options you need as K >= 5 could require a lot of data.

However, there are definitely ways to code the general case. The main idea is that you can use bitmasks to iterate over the power set (set of all subsets) as described e.g. at Common Bitwise Techniques for Subset Iterations I didn’t check this works, just that it compiles, but hopefully it will point you in the right direction. Feel free to ask for clarifications. I assumed that:

  • for each possible combination the underlying levels combine without any noise and the noise is added only after adding all the level means together
  • the probability of each level being added is independent of other levels being added
functions {
  int power_of_two(int x) {
    int ret = 1;
    for(i in 1:x) {
      ret = ret * 2;
    }
    return ret;
  }
}

data {
  int<lower=1> N;
  vector[N] x;
  vector[N] y;
}

transformed data {
   int K = 2; // length of alpha, beta, and level
   int N_combinations = power_of_two(K);
}

parameters {
  vector<lower=0>[K] alpha;
  vector<lower=0.1>[K] beta;
  positive_ordered[K] level_height;
  real<lower=0> noise;
}

model {
  //eta = 1/beta * (x - alpha);
  alpha ~ normal(12,2);       // Hyper-priors
  beta ~ normal(2,1);         // Hyper-priors
  level_height ~ normal(4,2); // Hyper-priors
  noise ~ normal(0,0.1);


  for (i in 1:N) {
    real lp[N_combinations];
    //Iterate over all 2^K combinations
    for  (c in 0:(N_combinations - 1))
    {
      real combination_lp = 0; // Log-probabiliy of the combination based on eta
      real combination_mean = 0; // Mean associated with the combination
      int remainder = c;
      for(k in 1:K) {
        real eta = 1/beta[k] * (x[i] - alpha[k]);
        int last_bit = remainder % 2; //Modulo division
        // Integer division, requires Stan > 2.24
        // In older versions you can use (with a warning)
        // remainder = remainder / 2;
        remainder = remainder %/% 2;
        if(last_bit) {
          // k-th level is included
          combination_mean = combination_mean + level_height[k];
          // Note that normal_lcdf(x | 0, 1) == log(Phi(x))
          combination_lp = combination_lp + normal_lcdf(eta | 0, 1);
        } else {
          // k-th level is excluded
          // Note that we use _lccdf (one more "c")
          // normal_lccdf(x | 0, 1) == log(1 - Phi(x))
          combination_lp = combination_lp + normal_lccdf(eta | 0, 1);
        }
      }
      lp[c] = combination_lp + normal_lpdf(y[i] | combination_mean, noise);
    }
    target+=log_sum_exp(lp);
  }
}

Note that we enforce ordering on the levels via positive_ordered, otherwise the model would be multimodal (switching all the parameters between two components will produce exactly the same likelihood).

Hope this clarifies more than confuses.

Hi,

Thank you very much for your quick response and providing a template for the code. I will look into it! K can sometimes be much larger than 5, so generally the data is largely undersampled compared to the number of combinations (2^k). Therefore, also not all levels may be observable in the data. So for large K (> 5) going over all combinations will take a lot of time and becomes indeed hard, but I don’t know whether going over all combinations could be avoided and using e.g. the general trend of the curve instead. To avoid overfitting (too large K), may then be mitigated by using a cost function for the number of implemented parameters. Your code is a useful starting point. I will also go through your suggested topics in more detail! Thanks a lot!