Binomial function and population model providing "real" population size

Here is a toy version of a population model

    ### STAN CODE - simplePopMod.stan ######################
    data {
      int popInit;
      real propYInit;
      real preg;
      real Ma;
      real Mj;
      real sigma;
      int years;
      // real ObsA[years];
    }

    parameters {
      real<lower=0, upper=1> Mjuv;
      // real pObs;
    }

    transformed parameters {
    vector<lower=0>[2] pop[years] ;
    real propY[years];

    pop[1,1] = popInit * propYInit;
    pop[1,2] = popInit * (1 - propYInit);
      
      for (year in 1:(years-1)){
        pop[year+1,1] = pop[year,2] * preg * (1 - Mjuv) ;
        pop[year+1,2] = (pop[year,1] + pop[year,2]) * (1 - Ma);
      }
    }

    model {
      Mjuv ~ normal(Mj, sigma);
      
      
      // pObs ~ uniform(0,1); 
      // for (year in 1:years){
      //   ObsA[year] ~ binomial(pop[year,2], pObs);
      // }
      
    }
    ### R CODE #################################################

    library(rstan)
    fitmodel = "simplePopMod.stan"

    stan.data <- list(years = 100, popInit = 100, propYInit = 0.3, preg = 0.4, Ma = 0.05, Mj = 0.6, sigma = 0.05)#, ObsA = ObsA)


    init_fun <- function() {list(
      Mj=runif(1, 0.5, 0.9)
    )}

    params <- c("Mjuv", "pop")

    nsamples <- 1000
    nburnin <- 1000
    ncore <- 1

    out <- stan(
      file = fitmodel,         
      data = stan.data,        
      pars = params,           
      init= init_fun,                      
      chains = ncore,          
      warmup = nburnin,        
      iter = nburnin+nsamples, 
      cores = ncore,           
      refresh = 100,           
    )

Now imagine you have for each year observation data of the number of adults (ObsA). You know that you have only a certain probability (pObs) to observe the truth (pop[year, 2]) and you know the relationship between the real number and the observed number is the result of a binomial process.
So I think I can write the following:

pObs ~ uniform(0,1); 
for (year in 1:years){
  ObsA[year] ~ binomial(pop[year,2], pObs);
}

However, in Stan, pop[year,2] is a “real” and the binomial function only accept integers (and you cannot convert a real into integer using the round function).

So if someone can provide a way to fit such relationship in Stan, that would be great !

1 Like

@jonah here is the toy population model you were asking for ! :-)

Great, thanks for sharing this. I’m not sure what the best solution is here but hopefully having this example will get more eyes on this. (If nobody responds definitely try posting it again after a while.)

Just to have breadcrumbs, I believe this sprang out from the discussion at Using beta binomial with size of the population provided as a "real" (where there are some links to papers proposing something related, which I unfortunately don’t understand).

One of the links follows through to Continuous binomial, negative binomial and Poisson where Mike warns that a proposed extension of the Binomial distribution is impractical.

I would however definitely try to reimplement the binomial distribution with real N directly, it might break badly, but it is IMHO worth trying, e.g.: (a sketch, didn’t even test if it compiles, could also be vectorized to have arrays as input instead)

functions {
real binomial_real_dangerous_lpmf(int k, real n, real p) {
  return  binomial_coefficient_log(n, k) + k * log(p) + (n - k) * log1m(p);
}
}

If that let’s you reliably recover the parameters from the simulation (and even better, if you can use SBC as a stronger check), I wouldn’t be too afraid to actually use it. For large-ish n, the difference between binomial_real_dangerous_lpmf(k, n, p) and binomial_lpmf(k, floor(n), p) should be pretty small.

A more difficult challenge would be to somehow ensure that ObsA[year] < pop[year,2] for all Mjuv values to make the likelihood well defined.

1 Like

@martinmodrak
Already too complex for me, I am afraid !
Don’t know how to avoid ObsA[year] < pop[year,2] for all Mjuv values

Then, while I totally believe you can eventually make some progress on this task and the community here will be happy to support you on the way, you should seriously ask yourself how much head-banging against the wall/computer are you willing to put into this :-) I have a creeping feeling (though I am far from certain) that what you aim for (the bigger model, not the toy example here) would be a neat small research project in applied statistics by itself :-)

The first step on this way would be to just put the function I wrote into your model and use it as:

 ObsA[year] ~ binomial_real_dangerous(pop[year,2], pObs);

I expect it to fail gloriously with a lot of error messages from the fitting process, but it also might somewhat work.

EDIT:

Yeah, that looks like some non-trivial math, I don’t see some immediate solution, but if you are in for a ride, you may try to learn the Wolfram language (they have free online notebook environments) and see if Mathematica can solve this inequality for you (I wouldn’t bet on it, though…). The discreteness of the model looks like additional trouble. Maybe there is a similar continuous version that would be more amenable to analysis, but I wouldn’t bet on it either.

Alternatively, you may just give up on the binomial assumption, assume the population model just gives you some sort of “mean” of the population and treat the response as negative binomial with the given mean.

1 Like

Even before I start thinking of trying (!) something more complex … I already have a problem just running the code you provided …

functions {
real binomial_real_dangerous_lpmf(int k, real n, real p) {
 return  binomial_coefficient_log(n, k) + k * log(p) + (n - k) * log1m(p);
}
}

I have copy/pasted this into my previous stan code after the transformed parameter section and it seems the parser does not like it.

PARSER FAILED TO PARSE INPUT COMPLETELY
STOPPED AT LINE 31: 
functions {
  real binomial_real_dangerous_lpmf(int k, real n, real p) {
    return  binomial_coefficient_log(n, k) + k * log(p) + (n - k) * log1m(p);
  }
} 

I never used the “functions” section before, but is there a specific order in which it should appear?

The functions go first - check out https://mc-stan.org/docs/2_23/reference-manual/overview-of-stans-program-blocks.html and https://mc-stan.org/docs/2_23/reference-manual/functions-chapter.html

@martinmodrak

Thanks … it compiles.
Note that it is suggested to use lchoose instead of binomial_coefficient_log which is deprecated.

However, how to be sure that everything goes as it should?

I have done the following to test:

  • I have run the model with a quasi fixed value for Mjuv (i.e. setting sigma to 0.00005) without fitting (i.e. as written in my original post) and extracted the pop[,2] vector.
  • Then I used the model with the fitting hability (i.e. including ObsA[year] ~ binomial_real(pop[year,2], pObs)) and provided the values I had extracted before (i.e. the pop[,2] vector) as if this was ObsA.
  • …and I was expecting the model to converge to pObs = 1

And it seems to be the case as it goes through all the iterations without a warning and the output shows Mjuv = 0.60 and pObs = 1 (but final n_eff = 1 … seems odd)

… however, if I provide ObsA as 0.7 * the pop[,2] vector previously extracted … expecting the model to find pObs = 0.7 … then Stan throws the following message "Stan model 'simplePopMod' does not contain samples." (Note, this is true to all values other than ObsA * 1 and ObsA * 0)

run with cores = 1, this will (almost always) show more detailed error messages.

@martinmodrak
In fact my problem came simply from the fact I was providing non integer values for ObsA.

So it works (without any error !) … however … this is clearly not efficient because with 10000 burnin and 10000 samples, I only obtain an n_eff ~ 1600.

Moreover, as you mentioned, if one ObsA[year] is larger than the corresponding pop[year,2], Stan gives the following:

Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.

So maybe you could just do a “dirty” solution and have:

ObsA[year] ~ neg_binomial_2(pop[year,2] * pObs, phi);

Where phi is an additional positive parameter? This would allow ObsA > pop[year, 2] but it has the same mean…

Another dirty solution would be to keep the binomial likelihood and compute new “total population” pop_prime[year] as a transformation of pop[year,2] so that

  • pop_prime[year] > ObsA[year]
  • for pop[year,2] much larger than ObsA[year], pop_prime[year] is close to pop[year,2]

Such a transform could be achived using the log1p_exp function (a.k.a Softplus)

In both cases, you need to extensively validate with simulated data whether this doesn’t introduce some bias into the final inference.

I.e.:

pop_prime[year] = ObsA[year] + log1p_exp(pop[year,2] - ObsA[year]);

@martinmodrak Thanks for your help! It was greatly appreciated
I think the “binomial_real_dangerous” approach is the closest answer to a solution that I can come up with so far. While not stable, it works, a least for one part of my bigger model.
So I will consider it as a solution to my question.
However, if anyone has an idea to improve this, I will consider it.

1 Like