Unable to recover mixture between Wiener-diffusion and uniform distribution

fitting-issues

#1

A common problem when modeling response time data with the wiener-diffusion model are contaminant response times. Especially fast outliers (sometimes called anticipatory responses) can essentially completely determine the non-decision time parameter (\tau). A common approach for dealing with this problem is assuming that the data comes from a mixture between the diffusion process and a uniform distribution bounded at the minimum and maximum observed response time (e.g., Ratcliff & Tuerlinkcx, 2002). It is usually assumed that the vast majority of trials come from the Wiener process (e.g., over 95%) and only a small faction comes from the uniform distribution.

I have generated such data and tried to recover the parameters in Stan. This works for the very simple case of having only one condition (i.e., intercept only for all parameters of the wiener model) after some tinkering with the priors, but not for the case with two conditions which differ in their drift rate. Not working here means that all post-warm up samples are divergent and the chains appear to be stuck.

I first generate the data in R.

library("rtdists")
set.seed(555)
rts <- rbind(
  cbind(rdiffusion(500, a=1, v=2, t0=0.5), condition = "A"),
  cbind(rdiffusion(500, a=1, v=-1.5, t0=0.5), condition = "B"))
rts$response2 <- as.numeric(rts$response)-1
prop_outlier <- 0.02
n_out <- round(prop_outlier*nrow(rts))
rts[sample(nrow(rts), n_out),"rt"] <- runif(n_out, 0.3, 1.5)
rts[sample(nrow(rts), n_out),"response2"] <- sample(0:1, size = n_out, replace = TRUE)
min(rts$rt) # [1] 0.3520957

It is straight forward to fit the data with the Wiener model alone using brms. However, thanks to the contaminants, the parameters do not recover very well (without the contaminants, the recovery would be basically perfect).

library("brms")
formula <- bf(rt | dec(response2) ~ 0+condition, 
              bs ~ 1, ndt ~ 1, bias ~ 1)
prior <- c(
  prior("normal(0, 2)", class = "b"),
  set_prior("normal(1, 1)", class = "Intercept", dpar = "bs"),
  set_prior("normal(0.2, 0.2)", class = "Intercept", dpar = "ndt"),
  set_prior("normal(0.5, 0.1)", class = "Intercept", dpar = "bias")
)
tmp_dat <- make_standata(formula, 
                         family = wiener(link_bs = "identity", 
                                         link_ndt = "identity",
                                         link_bias = "identity"),
                         data = rts, prior = prior)
initfun <- function() {
  list(
    b = rnorm(tmp_dat$K),
    temp_bs_Intercept = runif(tmp_dat$K_bs, 1, 2),
    temp_ndt_Intercept = runif(tmp_dat$K_ndt, 0.1, 0.2),
    temp_bias_Intercept = rnorm(tmp_dat$K_bias, 0.5, 0.1),
    lambda = runif(1, 0, 0.05)
  )
}
fit_orig <- brm(formula, 
                data = rts,
                family = wiener(link_bs = "identity", 
                                link_ndt = "identity",
                                link_bias = "identity"),
                prior = prior, inits = initfun,
                iter = 1000, warmup = 500, 
                chains = 4, cores = 4, 
                control = list(max_treedepth = 15))
summary(fit_orig)
# Population-Level Effects: 
#                Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# bs_Intercept       1.53      0.03     1.48     1.58       1225 1.00
# ndt_Intercept      0.34      0.00     0.34     0.34       2000 1.00
# bias_Intercept     0.45      0.01     0.43     0.48       1024 1.00
# conditionA         1.84      0.10     1.64     2.04        960 1.00
# conditionB        -1.01      0.09    -1.19    -0.83       1138 1.00

Next, I adapt the brms code by changing the likelihood to a mixture. So the likelihood becomes:

for (n in 1:N) {
  target += log_mix(lambda, 
      uniform_lpdf(Y[n] | min_Y, max_Y),
      wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])
  );
}

Where lambda is defined in the parameters block as real<lower=0,upper=1> lambda; and max_Y in the transformed data block as real max_Y = max(Y);.
The prior for lambda is target += normal_lpdf(lambda | 0.02, 0.01);
The full model code is posted below.

If fitting the model then I get only divergences and cannot find a way to deal with them (e.g., changing priors or control options does not appear to work).

library("rstan")
fit_mix <- stan(model_code = mixture_code, 
                data = tmp_dat, 
                chains = 4, 
                warmup = 500, iter = 1000, 
                init = initfun)
Warning messages:
1: There were 2000 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
3: Examine the pairs() plot to diagnose sampling problems

Any help for getting this model to work would be very much appreciated.

The full Stan model (which needs to be in a string mixture_code for the above code to work) is below.

// adapted from brms 2.4.0 generated code
functions { 
   real wiener_diffusion_lpdf(real y, int dec, real alpha, 
                              real tau, real beta, real delta) { 
     if (dec == 1) {
       return wiener_lpdf(y | alpha, tau, beta, delta);
     } else {
       return wiener_lpdf(y | alpha, tau, 1 - beta, - delta);
     }
   }
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=0,upper=1> dec[N];  // decisions 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  real min_Y = min(Y); 
  real max_Y = max(Y);
} 
parameters { 
  vector[K] b;  // population-level effects 
  real temp_bs_Intercept;  // temporary intercept 
  real temp_ndt_Intercept;  // temporary intercept 
  real temp_bias_Intercept;  // temporary intercept 
  real<lower=0,upper=1> lambda;
} 
transformed parameters { 
} 
model { 
  vector[N] mu = X * b;
  vector[N] bs = temp_bs_Intercept + rep_vector(0, N);
  vector[N] ndt = temp_ndt_Intercept + rep_vector(0, N);
  vector[N] bias = temp_bias_Intercept + rep_vector(0, N);
  // priors including all constants 
  target += normal_lpdf(b | 0, 2); 
  target += normal_lpdf(temp_bs_Intercept | 1, 1); 
  target += normal_lpdf(temp_ndt_Intercept | 0.2, 0.2); 
  target += normal_lpdf(temp_bias_Intercept | 0.5, 0.1); 
  target += normal_lpdf(lambda | 0.02, 0.01); 
  // likelihood including all constants 
  if (!prior_only) { 
    for (n in 1:N) {
      target += log_mix(lambda, 
          uniform_lpdf(Y[n] | min_Y, max_Y),
          wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])
      );
    }
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_bs_Intercept = temp_bs_Intercept; 
  // actual population-level intercept 
  real b_ndt_Intercept = temp_ndt_Intercept; 
  // actual population-level intercept 
  real b_bias_Intercept = temp_bias_Intercept; 
}

#2

What’s the posterior on lambda look like? Divergent transitions aren’t safe for inferrin’, but they’re good for figuring out what’s going on.

Does the model sample with a fixed lambda (pass it in as data)?


#3

I probably should have said this before, but unfortunately, it also does not converge with a fixed lambda.
More specifically, it works with fixed lambda if it is exactly 0. In all other cases, even if it is just above 0 (i.e., 0.000000001), the model appears to produce not a single non-divergent transition. Just to make this clear, it also does not work when I fix lambda to the true value of 0.02. For example I get the following output:

Warning messages:
1: There were 1995 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
3: Examine the pairs() plot to diagnose sampling problems

                       mean se_mean     sd     2.5%     25%     50%     75%   97.5% n_eff   Rhat
b[1]                  -0.60    0.65   0.91    -1.41   -1.25   -0.98   -0.26    0.93     2 129.96
b[2]                  -0.38    0.69   0.98    -1.59   -1.18   -0.38    0.42    0.84     2 461.77
temp_bs_Intercept      1.36    0.10   0.15     1.14    1.29    1.37    1.51    1.52     2  13.01
temp_ndt_Intercept     0.35    0.00   0.00     0.35    0.35    0.35    0.35    0.35     4   1.32
temp_bias_Intercept    0.55    0.06   0.09     0.42    0.50    0.57    0.63    0.65     2  17.12
b_bs_Intercept         1.36    0.10   0.15     1.14    1.29    1.37    1.51    1.52     2  13.01
b_ndt_Intercept        0.35    0.00   0.00     0.35    0.35    0.35    0.35    0.35     4   1.32
b_bias_Intercept       0.55    0.06   0.09     0.42    0.50    0.57    0.63    0.65     2  17.12
lp__                -675.10  196.02 277.79 -1086.58 -857.58 -651.52 -453.40 -341.62     2  44.04

The fact that it does not even work when I fix lambda suggests to me that something seems to be really wrong about the model. But I have no idea what as the change seems not too dramatic and quite straightforward. Simply add the mixture weight to log_mix as well as the uniform distribution.

For reference, the following shows the traceplot and the density for the model in which lambda is estimated (i.e., the model from the first post). It shows that the chains appear to not move after warmup.


#4

Ooooooooooo.

uniform_lpdf(Y[n] | min_Y, max_Y)

Is all data here. Stan only computes log probabilities up to a proportionality constant. The up to a proportionality constant is determined by autodiff types in the lpdf. Since everything here is data, there is no autodiff type there, so Stan says the whole thing might as well be a constant and sets it equal to zero. (Edit: This is wrong, see Unable to recover mixture between Wiener-diffusion and uniform distribution)

So that could behave strangely. Man, now that I think about it, that whole thing doesn’t seem like it would play well with mixtures.

@bgoodri lemme get this straight. The propto behavior is cool as long as we’re multiplying conditional pdfs together (and MCMC lets us cancel constant denominators), but might it break when we do mixtures cause in that case we’re adding things?

@Henrik_Singmann Could you try the calculation with the lpdf computed manually?

So that would be -N * log(max_Y - min_Y) I think.


#5

Indeed, the uniform component has only data and no parameters. Hopefully this is the problem and there is a solution.

Unfortunately, using your suggestion of calculating the lpdf by hand does not seem to have worked, at least not the way I tried it. I have tried three different variants (note that in each case this iterates over each individual data point n):

  1. target += log_mix(lambda, 
               -log(max_Y - min_Y),
               wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])
    );
    
  2. target += log_sum_exp(log(lambda) - log(max_Y-min_Y),
     log1m(lambda) + wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])
    );
    
  3. target += log(lambda*(1/(max_Y-min_Y)) + 
               exp(log1m(lambda) + 
                wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])));
    

With not worked I mean that I tried the three variants both with free and fixed lambda and I got the same problem (i.e., no non-divergent transitions and no movement in the chains) for essentially all samples (unless lambda = 0). The traceplots looked as the one in my previous post.


#6

The _lpdf() versions of the log prob functions compute the constants. So… these are equal:

uniform_lpdf(Y[n] | min_Y, max_Y)
-N * log(max_Y - min_Y

I don’t think that’s the root of the problem, unfortunately.


#7

@syclik Thanks for that info. I was kinda confused as to how mixtures were supposed to work.


#8

So I had a go at this.

Should the outlier generation look like this?

outliers = sample(nrow(rts), n_out)
rts[outliers,"rt"] <- runif(n_out, 0.3, 1.5)
rts[outliers, n_out),"response2"] <- sample(0:1, size = n_out, replace = TRUE)

Like, the outliers in rt and response2 happen at the same places?

I plotted the data a couple times and it looked like the outliers stick out the bottom.

data

So I just said ignore those couple datapoints with the oh so elegant:

if(Y[n] > 0.5) {
  target += log_mix(lambda, 
    uniform_lpdf(Y[n] | min_Y, max_Y),
    wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n])
  );
}

And it sampled and I got a happy posterior:

                      mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
b[1]                  2.16    0.01 0.14   1.90   2.07   2.14   2.24   2.44   324    1
b[2]                 -1.55    0.01 0.13  -1.82  -1.64  -1.55  -1.46  -1.29   302    1
temp_bs_Intercept     0.98    0.00 0.02   0.94   0.96   0.98   0.99   1.01   478    1
temp_ndt_Intercept    0.50    0.00 0.00   0.50   0.50   0.50   0.50   0.51   498    1
temp_bias_Intercept   0.50    0.00 0.01   0.48   0.49   0.50   0.51   0.52   338    1
lambda                0.05    0.00 0.01   0.04   0.05   0.05   0.06   0.07   374    1
b_bs_Intercept        0.98    0.00 0.02   0.94   0.96   0.98   0.99   1.01   478    1
b_ndt_Intercept       0.50    0.00 0.00   0.50   0.50   0.50   0.50   0.51   498    1
b_bias_Intercept      0.50    0.00 0.01   0.48   0.49   0.50   0.51   0.52   338    1
lp__                273.13    0.10 1.65 269.24 272.25 273.42 274.31 275.56   254    1

I’m guessing something is going on numerically in the gradients of wiener_lpdf when the values of Y are very small that’s messing up HMC? I dunno, maybe it’s clearer to you what’s going on.


#9

Not sure, if this is relevant, but if the non-decision time parameter (t0 in the above R code and tau in the Stan model) is larger than y, the wiener_lpdf will always be -Inf, regardless of what values the other parameters take. Maybe this is a problem for the calculation of gradients?


#10

@Guido_Biele Thanks for saying this. This was the one point I was not thinking about enough and which brought me to the solution. If we use the numerically robust methods to calculate the mixture likelihood via the log-likelihoods (e.g., via log_mix or log_sum_exp) the likelihood of 0 becomes -Inf and we apparently run into serious problems. However, if we think about the mixture on the likelihood scale, it is no problem as (1 - \lambda) \times 0 = 0. So this mixture component can simply be safely ignored and there is no problem. Luckily it is easy to identify these situations in the present case and we can avoid calculating this component (see below for the code).

I think this realization might also be of interest to the Stan community in general. If we build a mixture model in which one of the mixture components can produce a likelihood of 0 for certain parameters settings using the recommended functions (i.e., log_mix or log_sum_exp) can lead to a model that produce no reasonable samples. Maybe it would be a good idea to add a short paragraph on this problem and how to avoid this in the documentation. The perhaps better alternative could be if these functions could catch this automatically and simply drop a mixture component if the corresponding log-likelihood is -Inf.

In the present case I simply wrote a new likelihood function which catches when it happens and only includes the wiener process if the likelihood is non-zero.

real wiener_diffusion2_lpdf(real y, real mu, real bs, 
                            real ndt, real bias, real lambda, 
                            int dec, real min_rt, real max_rt) {
  if (y < ndt) {
    return(log(lambda) + uniform_lpdf(y | min_rt, max_rt));
  } else {
    if (dec == 1) {
      return log_mix(lambda, 
        uniform_lpdf(y | min_rt, max_rt),
        wiener_lpdf(y | bs, ndt, bias, mu)
      );
    } else {
      return log_mix(lambda, 
        uniform_lpdf(y | min_rt, max_rt),
        wiener_lpdf(y | bs, ndt, 1 - bias, - mu)
      );
    }
  }
}

The target increment then becomes:

for (n in 1:N) {
  target += wiener_diffusion2_lpdf(Y[n] | mu[n], bs[n], ndt[n], bias[n], lambda, 
                                          dec[n], min_Y, max_Y);
}

This model runs without any problems and is able to recover the parameters okayish:

4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                      mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
b[1]                  2.13    0.00 0.14   1.85   2.04   2.13   2.22   2.41  1431    1
b[2]                 -1.51    0.00 0.14  -1.78  -1.61  -1.51  -1.42  -1.25  1459    1
lambda                0.06    0.00 0.01   0.04   0.05   0.06   0.06   0.07  1606    1
b_bs_Intercept        0.97    0.00 0.02   0.93   0.96   0.97   0.99   1.01  1840    1
b_ndt_Intercept       0.50    0.00 0.00   0.50   0.50   0.50   0.50   0.51  1861    1
b_bias_Intercept      0.50    0.00 0.01   0.48   0.49   0.50   0.51   0.52  1376    1
lp__                244.05    0.07 1.81 239.59 243.09 244.37 245.42 246.57   752    1

@bbbales2 Thanks for catching the error in the outlier generation. That was indeed what I intended to do.


#11

Nice that you found a solution!

I don’t think that the problem is the log_mix or log_sum_exp function, because they should not have a problem with zero probabilities. For example, try

library(matrixStats) 
logSumExp(c(log(.5)+log(0),log(.5)+log(.5)))

which correctly returns the same as log(exp(log(.5)+log(.5))).

I would guess that your code solves the problem, because it avoids calculating gradients when y < t_0 and wiener_lpdf returns -Inf.