Modeling mixture of two gamma knowing that the one with the highest value is the output


In my toy example there are two processes that have a gamma distribution (with the same scale 1), and in every trial the one with the largest outcome is “selected”. In R, the process would look like this:

N <- 1000
alpha1 <-4
alpha2 <-3
y1 <- rgamma(N,alpha1)
y2 <- rgamma(N,alpha2)

And my data would look like this:

data <- list(N=N, y = pmax(y1, y2))

I want to estimate alpha1 and alpha2.

My logic was the following, the proportion of time that y1 was selected is:

mean((y1/y2) >1 )
# 0.642

But this can also be estimated analytically with the Beta prime distribution:

# 0.65625

So I’m trying to use the cdf of the beta prime distribution (betapr_cdf in my model) to estimate the mixing proportions, and my idea is that when gamma1 is selected then gamma2 is “censored”, it had a smaller value and then I use the cdf of the gamma, and the other way around when gamma1 is selected.
I’m writing because obviously, it didn’t work!

(I’m starting with super strong priors, but it doesn’t work anyways).

functions {
  real betapr_cdf(real x, real alpha, real beta){
    return inc_beta(alpha, beta,x/(1+x));
data {
  int N;
  real<lower=0> y[N];
parameters {
  real<lower=0> alpha1;
  real<lower=0> alpha2;
transformed parameters{
  // Probability of selecting P1:
  real <lower= 0, upper=1> P1 = betapr_cdf(1,alpha2, alpha1);
model {
  alpha1~ normal(4,.1);
  alpha2 ~ normal(3,.1);
  for (n in 1:N){
    target += log_mix(P1,  gamma_lpdf(y[n]|alpha1,1) + gamma_lcdf(y[n]|alpha2,1),
                           gamma_lpdf(y[n]|alpha2,1) + gamma_lcdf(y[n]|alpha1,1)); 

The problem is that it doesn’t even start:

out <- stan("rate.stan", data=data, chains =1,iter=500)

Chain 1: Rejecting initial value:
Chain 1:   Gradient evaluated at the initial value is not finite.
Chain 1:   Stan can't start sampling from this initial value.
Chain 1: 
Chain 1: Initialization between (-2, 2) failed after 100 attempts. 
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

And when I print different parts of the model I noticed that P1 is stuck at 0 in the initialization, but it shouldn’t be because alpha1 and alpha2 have sensible values in the same iteration!

So I’m stuck!
(I’ve already checked that betapr_cdf is working fine…)

(By the way, I’m interested in understanding why this model fails, but if someone has an alternative implementation I’ll be happy to hear. I don’t care so much if the processes are gamma, they just have to be distributed with some skewed distribution with positive support: lognormal, Frechet, Weibull, all these would work).

Just replace the P1-related line with

  real <lower= 0, upper=1> P1 = betapr_cdf(1.0, alpha2, alpha1);

A matter of int and real compatibility.

Now, I don’t quite understand the lcdf terms.

Let Z be a random variable that takes the value 1 if Y1 > Y2 and 0 otherwise.Then it seems to me that
Pr(Y = y) = Pr(Y = y \mid Z = 1)Pr(Z =1) + Pr(Y = y \mid Z = 0)Pr(Z = 0) ,
which means
Pr(Y = y) = \text{logmix} (p_1, \text{gamma}( y | \alpha_1, 1), \text{gamma}( y | \alpha_2, 1) ),
where p_1 := Pr(Z = 1).
So the model block should be simply

  for (n in 1:N){
    target += log_mix(P1,  gamma_lpdf(y[n]|alpha1, 1),
                           gamma_lpdf(y[n]|alpha2, 1));

This gives

> out
Inference for Stan model: gamma_max.
1 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=250.

           mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
alpha1     4.72    0.01 0.07     4.60     4.68     4.72     4.77     4.84   164 1.00
alpha2     3.28    0.01 0.11     3.07     3.21     3.28     3.36     3.48   153 1.00
P1         0.70    0.00 0.02     0.67     0.69     0.70     0.71     0.74   146 1.00
lp__   -2106.50    0.11 1.12 -2109.22 -2106.90 -2106.15 -2105.73 -2105.49   110 1.02

Samples were drawn using NUTS(diag_e) at Tue Jun 25 13:29:07 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

PS. please ignore the notational subtleties. Pr(Y = y) = 0 because the distributions are continuous, but you catch my drift.

1 Like

Thanks for taking a look at it!

But really?? :( But my function is defined only for real numbers in x! I would assume that even if 1 is an integer it will get transformed to 1.0 for the function.

I’m obviously wrong, because without it, it works, and not with it. But my logic was the following, in a mixture, in general, you just get that one of the components was selected and not the other one. I wanted to include the information, that one when 1 component was selected the other one produced a value that was smaller than the one selected. Something on the lines of integrating censored data in the manual..

In any case, thanks for the help!!, I was stuck for hours with this silly mistake!

By the way, do you happen to know, if lognormal, Weibull, Frechet, or truncated normal have a ratio that is also a distribution with a nice cdf ?
Difference, instead of ratio, would also help, I could use evaluate when the difference is positive or negative, with a cdf of 0.


And you’d be wrong. Stan does not do int to real promotion by default. At any rate, the fix is easy and works, right?

Well, it kinda works. I’ve been playing round with this and it seems the model is not identifiable. Try a few values of \alpha_1 and \alpha_2 and let me know.

The ratio of two log-normals is also a log-normal, so that potentially has a nice pdf. Would have to think about the others.

what is the relationship between the parameters of the new “ratio” lognormal and the original two?

If Y_1 \sim \text{lognormal}(\mu_1, \sigma_2^2) and Y_2 \sim \text{lognormal}(\mu_2, \sigma_2^2) with Y_1 and Y_2 independent, then Z = Y_1/Y_2 \sim \text{lognormal}(\mu_1-\mu_2, \sigma_1^2 + \sigma_2^2) . See this.



\pi(x_{1}, x_{2} \mid x_{1} > x_{2}) = \pi_{1} (x_{1}) \, \pi_{2}(x_{2}) \, \mathbb{I} [ x_{1} > x_{2} ]

so that

\begin{align*} \pi(x_{1} \mid x_{1} > x_{2}) &= \int_{0}^{\infty} \mathrm{d} x_{2} \, \pi(x_{1}, x_{2} \mid x_{1} > x_{2}) \\ &= \int_{0}^{\infty} \mathrm{d} x_{2} \, \pi_{1} (x_{1}) \pi_{2}(x_{2}) \mathbb{I} [ x_{1} > x_{2} ] \\ &= \pi_{1} (x_{1}) \int_{0}^{\infty} \mathrm{d} x_{2} \, \pi_{2}(x_{2}) \mathbb{I} [ x_{1} > x_{2} ] \\ &= \pi_{1} (x_{1}) \int_{0}^{x_{1}} \mathrm{d} x_{2} \, \pi_{2}(x_{2}) \\ &= \pi_{1} (x_{1}) \, \Pi_{2}(x_{1}) \end{align*}

where \pi denotes a probability density function and \Pi denotes a cumulative distribution function.


You mean \pi_1(x_1)\Pi_2(x_1), right?

Yeah – typo fixed. Thanks.

Sorry my utter ignorance, but hint for what?

yeah, it’s not. And it’s not better with the lognormal model. P_1 jumps very quickly to 0 or to 1…
Would it be possible to put a beta prior on P_1? (i.e., Is there some jacobian that would allow me to do this?)
(I know I could constraint the alphas, but in the real model, alpha_1 and alpha_2 are derived from different regressions)

Ok, here I go again. This represents more my actual goal (it’s still a toy model, though). I manage to fit it, it converges, but it gives me estimates that are too accurate and wrong. (I hope it’s still ok, that I post under a misleading title, should I make a new thread? change the title?)

I want to model two processes, one behaves according to the practice law (_to in the model, like time out), and the other one depends on a predictor (_proc). It’s a race, and the one that finishes first is the one observed:

N <- 5000
X <- rnorm(N)
Int_to <- log(1000) ##log(1000) = 6.9
Int_proc <- 6
r <- .0005
#effect of something:
mu_proc <- Int_proc + X *.1
mu_to <- Int_to + -r*(1:N)
sigma <- .5
proc <- rlnorm(N,mu_proc,sigma)
to <- rlnorm(N,mu_to,sigma)
data <- tibble(trial=1:N,X, proc=proc, to=to, RT = pmin(proc, to))
mean(proc < to)
plnorm(1, proc - to, sqrt(2) * sigma) %>% mean
# It will look like this:
# crosses are the observed values, proc, and to 
# are the finishing times of the  underlying processes
data %>% gather(key="type",value="T", proc, to) %>%
    ggplot( aes(x=trial, y=T, color = type)) + 
   geom_point(alpha=.5) +
    geom_point(aes(y=RT), size=2, color= "black", alpha=.1, shape =3)

This is the actual model:

functions {
  real lognormal_approx_cdf(real x, real mu, real sigma){
    real x_n = (log(x)-mu)/sigma;
    return inv_logit(0.07056 * x_n^3 + 1.5976* x_n);
data {
  int N;
  vector<lower=0>[N] RT;
  vector[N] X;
  vector[N] trial;
parameters {
 real<lower=0> sigma;
  real Int_proc;
  real Int_to;
  real slope;
  real<lower=0> r;
transformed parameters{
  vector[N] mu_proc= Int_proc + X* slope;
  vector[N] mu_to = Int_to  - r* (trial/1000);
  vector[N] P_proc;
    //  given:
    // X ~ lognormal(mu_proc, sigma)
    // Y ~ lognormal(mu_to, sigma)
    // then:
    // X/Y ~ lognormal(mu_proc - mu_to, sqrt(2.0)* sigma)
    // X/Y < 1 => X was faster and was selected
    // P(X < Y) = lognormal_cdf(1, mu_proc - mu_to, sqrt(2.0)* sigma)
  for(n in 1:N)
    // P_proc[n] = lognormal_approx_cdf(1.0,  mu_proc[n] - mu_to[n], sqrt(2.0)*sigma);
    P_proc[n] = lognormal_cdf(1.0,  mu_proc[n] - mu_to[n], sqrt(2.0)*sigma);
model {
  Int_proc~ normal(6,1);
  Int_to ~ normal(log(1000),3);
  slope ~ normal(.1,.3);
  r ~ normal(.5,.5);
  sigma ~ normal(.5,.5);
  for (n in 1:N){
    target += log_mix(P_proc[n],  lognormal_lpdf(RT[n]|mu_proc[n],sigma),

(There is code of an approximation of the CDF in functions, but it doesn’t make any difference).

data_simple <- as.list(data) %>% c(N=nrow(data))
out_simple <- stan("rate_l2.stan", data=data_simple, chains =2,iter=1000)
print(out_simple, pars= c("Int_proc","Int_to","slope", "r", "sigma"))
## Inference for Stan model: rate_l2.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.

## mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## Int_proc 5.83       0 0.01 5.80 5.82 5.82 5.83  5.85   789    1
## Int_to   6.40       0 0.03 6.35 6.39 6.40 6.42  6.45   380    1
## slope    0.11       0 0.01 0.08 0.10 0.11 0.12  0.13   949    1
## r        0.40       0 0.01 0.38 0.39 0.40 0.40  0.41   384    1
## sigma    0.43       0 0.00 0.42 0.43 0.43 0.43  0.44   998    1

## Samples were drawn using NUTS(diag_e) at Wed Jun 26 14:50:47 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

As you see, I get very accurate wrong answers, so I’m mispecifying something in the model, but I’m not sure what…

For the generative model. This is the density for the observations when the first component of two is the largest. There’s no need to model any mixture.

But shouldn’t I know x_1 > x_2 to do that? I don’t know that, I don’t know which one was selected.

In that case you’ll have to marginalize out between both possibilities, which brings you into the world of order statistics. The distribution of the maximum can be worked out analytically in terms of cumulative distribution functions, see for example If you try to fit the joint, unmarginalized model then you’ll probably run into many degeneracies related to the fact that you don’t know which of the components you’ll actually observe.

1 Like

Which is precisely what we observe with the formulation I gave. I should’ve seen that one coming…

Imposing an ordering on the parameters might help, but then you have to assume that you observe the same component across all observations. If which component “wins” varies then models for the censored observations will be pretty degenerate unless you can do the marginalization analytically.

Yes, I don’t know which one wins; look at the description of the last toy model. The thing is that the components are based on different regressions and at the beginning one almost always wins and towards the the other almost wins. So I have some useful info about the components. The weird thing is that the last model converges, no errors or anything, but it gives me estimates that are slightly off and too precise. So there must be something wrong in the model.

What does it mean “degenerate”? Not identifiable? Or the fact that it gives me bad estimates. Is there a way to make it not degenerate?

And going back to the right likelihood, shouldn’t I add the cdfs in the mixture besides the pdfs?

UPDATE: I’ve run it with many chains now, the model has a lot of different precise modes :(

I’ve created a repo showing exactly what I want to do and where I get stuck: