Getting a likelihood based on the root of an equation

Hi all,

I think this cannot be done in Stan, but I’ll be happy to be wrong, so just checking with the experts:
Say that I can find the root of an equation that depends on the parameters of my model (e.g., alpha, beta) and some independent variables (e.g., x) (using the bisection method for example):

root = bisect(equation, x, alpha, beta)

Is there a way to get the likelihood of a dependent variable conditional on the root of the equation? Something like L(y | bisect(equation, x, alpha, beta))?
I would assume that the highest likelihood would be when y = root, and as the distance gets larger it will decrease.
I know I can simulate my y pretty well, so I could use approximate Bayesian computation (ABC), just checking if there’s some way to get something (that approximates maybe) my likelihood function.

(If the problem is that the bisect is not differentiable, could I get this solved by using algebra_solver_newton or some other root finder?)

(If ABC is the only way, any recommendation for an R-based package?).

Best,
Bruno

After a second thought, I guess this doesn’t make sense, because the root will be a distribution and y a point value. (So I’m even less certain that this is doable in Stan.)

I don’t understand your goal here – the root finding in a deterministic operation and it won’t have any connection to the observed data until you define an observational model.

If you’re implying no measurement noise then the observational model would be

\pi(y \mid \text{root} ) = \delta(y - \text{root})

which would definitely be way too singular for Stan’s HMC sampler.

If there is measurement noise then you’ll have to define it, for example

\pi(y \mid \text{root}) = \text{normal}(y \mid \text{root}, \sigma).

For ABC the observational model would implicitly be defined by distance function/summary statistic you choose, which is one of the reasons why ABC is the absolute worst. Just personal opinion.

Thanks for taking a look!

But the root is not deterministic if the equation depends on a random variable, right? My problem is the observational model.

I’ll show you what I mean with a simplification of my generative model in R that can generate a set of N observations in rt.

mon_fun <- function(t, A_1=250, A_2=15, A_3=8)  {
  A_3 * t * plogis(A_2 * t * plogis(A_1 * t^2))
}
equation <- function(t, A_1, A_2, A_3) 1 - mon_fun(t, A_1, A_2, A_3)
# data
N <- 100
x <- runif(N, 0, 1)
# true values:
alpha <- 250
beta <- 30
A_2 <- 15
median_A_3 <- 8
sigma <- .5
A_1 <- alpha + beta * x
# There is at least this stochastic part,
# but maybe all the As are stochastic
A_3 <- rlnorm(N, log(median_A_3), sigma)
# container for the data:
rt <- vector("numeric", N)
for(n in 1:N){
  rt[n] <- round(pracma::bisect(equation, a = 0,
                                b=5,
                                A_1 = A_1[n],
                                A_2 = A_2,
                                A_3 = A_3[n])$root, 3)
}
head(rt)
#> [1] 0.116 0.109 0.147 0.154 0.284 0.164

Created on 2022-09-01 with reprex v2.0.2

Would it make sense to add an extra layer of measurement noise and do something like this?


for(n in 1:N)
    rt_[n] = algebra_solver_newton(equation,
                                   [.3]', //guess
                                   [alpha, beta, A_2, A_3]',
                                   {x[n]}, nothing);

// rt is in seconds and cannot be measured with precision smaller than milliseconds
target += normal_lpdf(rt | rt_, .001); 

where rt is a vector of N observations: ( reaction times in milliseconds)

Or is there another way to get ?

\pi( y | root)

(I saw that there is a method called probabilistic bisection, but it looks as ugly as ABC. I’m not a big fan either, but I’d like to fit my model :) (or an approximation of it)).

Best,
Bruno

So the unknown parameters A_1, A_2, A_3 are sampled from a prior and then the equation

1=F\left(t,A_{1},A_{2},A_{3}\right)=A_{3}t\mathrm{plogis}\left(A_{2}t\mathrm{plogis}\left(A_{1}t^{2}\right)\right)

is solved for t and the solution is observed with negligible error.

That’s correct in theory but super narrow distribution are likely to sample very slow.

For a very narrow distribution you can approximate integrating out one parameter. Let’s say, A_3 because it’s easy to solve A_3 from the equation given t, A_1 and A_2. The observational model is

t^{\star}\sim N\left(t,\sigma\right)

and in the limit of \sigma \to 0 this looks like

A_{3}^{\star}\sim N\left(A_{3},\left|\frac{\partial A_{3}}{\partial t}\right|\sigma\right)

where A_{3}^{\star} is the value corresponding to the observed root (keeping A_1, A_2 constant).
Now integrate and the probability is

\int\mathbb{P}\left(A_{1},A_{2},A_{3}\right)\mathrm{d}A_{3}\propto\mathbb{P}\left(A_{1},A_{2},A_{3}^{\star}\right)\times\left|\frac{\partial A_{3}}{\partial t}\right|\times\sigma

The distribution is modified by a multiplier that looks like a Jacobian adjustment.

Implicit function differentiation says that

\frac{\partial A_{3}}{\partial t}=-\left(\frac{\partial F}{\partial A_{3}}\right)^{-1}\left(\frac{\partial F}{\partial t}\right)

The derivatives are (I think)

\frac{\partial F}{\partial A_{3}}=\frac{F}{A_{3}}
\frac{\partial F}{\partial t}=\frac{F}{t}+A_{2}t\mathrm{plogis}^{\prime}\left(A_{2}t\mathrm{plogis}\left(A_{1}t^{2}\right)\right)\left(\mathrm{plogis}\left(A_{1}t^{2}\right)+2A_{1}t^{2}\mathrm{plogis}^{\prime}\left(A_{1}t^{2}\right)\right)

and of course at the solution F=1.
So, something like this

data {
  int N;
  vector[N] rt;
  real median_A_3;
  real<lower=0> sigma;
  real alpha;
  real<lower=0> beta;
}
parameters {
  real<lower=0> A_2;
  vector<lower=0,upper=1>[N] x;
}
transformed parameters {
  vector[N] A_1 = alpha + x*beta;
  vector[N] A_3;
  for (i in 1:N) {
    A_3[i] = 1.0/(rt[i]*plogis(A_2*rt[i]*plogis(A_1[i]*rt[i]*rt[i])));
  }
}
model {
  x ~ uniform(0,1);
  //A_2 ~ prior() (if it's a parameter)
  A_3 ~ lognormal(median_A_3, sigma);

  // integrated likelihood, up to a constant
  for (i in 1:N) {
    real rt_ = rt[i];
    real A1rt2 = A_1[i]*rt_*rt_;
    real pA1rt2 = plogis(A1rt2);
    real d = A_2*rt_*plogis_der(pA1rt2)*(pA1rt2 + 2*A1rt2*plogis_der(A1rt2));
    target += log(A_3[i]) + log(1/rt_ + d);
  }
}
3 Likes

Waw, if I can make this work… This is quite amazing! Thanks!!!

I have many questions about the math. Also I couldn’t make the Stan model based on this to recover a parameter of the model even assigning super strong priors to all the rest of the parameters. (See at the end). If this involves too much of your time, please DM me, we can discuss it. I’ll be happy to co author you of whatever I end up writing using this model! I’m mind blown that this can even be done!

yes, but also {A_3}_i \sim LogNormal(\mu, \sigma) for each observation i, not sure if it matters. I think that the important thing is that if I could solve the equation for t analytically (which I can’t), I would have the distribution of t.

Notice that the observational model t^{\star}\sim N\left(t,\sigma\right) with \sigma =.001 was just my hack in order to fit t^\star, not sure if it’s very meaningful.

In general, I don’t really understand why my problem can be solved by integrating out an arbitrary parameter of my model… (feel free to tell me what to read if this question is trivial).

Here are my detailed questions:

I’m completely lost here on what A_{3}^{\star} is, and how do you derive its distribution.

Sorry, again here, why are they proportional?

Sorry, again why? from where is the minus coming? I don’t get which equation you’re differentiating…

Now a question about the Stan code: you have target adding the log of \partial A_3/ \partial t. Shouldn’t we also have the log of \sigma? But since \sigma \to 0, wouldn’t I need to add log(0)?? Or am I wrong here?

Ok, even if I didn’t understant why this should work, given that only x and rt are data and the other stuff are parameters I rewrote the model. Also plogis in stan is inv_logit and its derivative is (inv_logit(x)*(1-inv_logit(x)). I have extremely strong priors in all the parameters except in A_2, to see if I could recover it, but I couldn’t. Not sure if there is a trivial error in the math or the code, but I’ll need to understand more what I’m doing to figure that out :/

data {
  int N;
  vector[N] rt;
  vector<lower=0,upper=1>[N] x;
}
parameters {
  real lalpha;
  real lbeta;
  real<lower = 0> A_2;
  real lmedian_A_3;
  real<lower = 0> sigma;
}
transformed parameters {
  vector[N] A_1 = exp(lalpha) + x*exp(lbeta);
  vector[N] A_3;
  for (i in 1:N) {
    A_3[i] = 1.0/(rt[i]*inv_logit(A_2*rt[i]*inv_logit(A_1[i]*rt[i]*rt[i])));
  }
}
model {
   target += normal_lpdf(lalpha | log(250),.0001);
   target += normal_lpdf(lbeta | log(30), .0001);
   target += normal_lpdf(A_2 | 15, 5);
   target += lognormal_lpdf(A_3 | lmedian_A_3, sigma);
   target += normal_lpdf(lmedian_A_3 | log(8), .0001);
   target += normal_lpdf(sigma | .5, .0005);
  // integrated likelihood, up to a constant
  for (i in 1:N) {
    real rt_ = rt[i];
    real A1rt2 = A_1[i]*rt_*rt_;
    real pA1rt2 = inv_logit(A1rt2);
    real ppA1rt2 = inv_logit(pA1rt2);
    real d = A_2*rt_*(ppA1rt2*(1- ppA1rt2))*(pA1rt2 + 2*A1rt2*(pA1rt2 * (1-pA1rt2)));
    target += log(A_3[i]) + log(1/rt_ + d);
  }
}
generated quantities {
real alpha = exp(lalpha);
real beta = exp(lbeta);
real median_A_3 = exp(lmedian_A_3);
}

Best,
Bruno

Sorry, framing it as integrating out a parameter was confusing.

You said you can solve the problem with ABC. How does that go? I assume it’s

  1. Draw A_1,A_2,A_3 from the prior
  2. Solve for t
  3. Accept the draw if \left|t-t^{\star}\right|<\epsilon, otherwise try again

Stan can sample from the same distribution like so:

  1. The prior is the prior for parameters A_1,A_2
  2. The likelihood is the probability of generating an accepted draw, conditional on A_1,A_2.

At step (2), the draw is accepted if A_3 is such that solving the equation yields a root in a narrow interval around the observed t^\star. This in turn means A_3 falls in a narrow interval around A_3^\star (the value of A_3 that implies t=t^\star). The equation is

1=A_{3}tf\left(t,A_{1},A_{2}\right)

(where the complicated parts are hidden inside the function f) and while solving for t requires iterative methods, solving for A_3 when t is known is straightforward

A_{3}^\star=\frac{1}{t^\star f\left(t^\star,A_{1},A_{2}\right)}

(I don’t know why I brought up implicit differentiation previously; you can get \frac{\partial A_3}{\partial t} from the above directly.)
Since we’re dealing with a super narrow interval we can approximate

\left|t-t^{\star}\right|<\epsilon\Leftrightarrow\left|A_{3}-A_{3}^{\star}\right|<\left|\frac{\partial A_{3}}{\partial t}\right|\epsilon

and the probability of hitting that interval

\mathbb{P}\left(\left|A_{3}-A_{3}^{\star}\right|<\left|\frac{\partial A_{3}}{\partial t}\right|\epsilon\right)\approx 2\epsilon \left|\frac{\partial A_{3}}{\partial t}\right| p\left(A_{3}^{\star}\right)

where p\left( A_3\right) is the probability density function for the (prior) distribution of A_3 and \epsilon is a constant (so it doesn’t matter for the posterior distribution).

The above really just amounts to computing the distribution of t — you don’t need a analytic solution for the equation, the density function is the density function of A_3 multiplied by the Jacobian \left|\frac{\partial A_{3}}{\partial t}\right|.

I think I did the derivative wrong; terms always seem to go missing when I apply the chain rule more than once…

Here’s a second attempt (using the identity 1-inv_logit(x)==inv_logit(-x))

for (i in 1:N) {
  real rt_ = rt[i];
  real A1rt2 = A_1[i]*rt_*rt_;
  real invlgA1 = inv_logit(A1rt2);
  real d = inv_logit(-A_2*rt_*invlgA1)*invlgA1*(1+2*A1rt2*inv_logit(-A1rt2));
  target += log(A_3[i]) + log(1/rt_ + A_2*d);
}
3 Likes

🤯 It works!

Thanks for the super clear explanation!

The only thing I didn’t get was this approximation: Why is this true?

Is there a name for this way of getting a likelihood?

Best,
Bruno

The probability is the integral of the density function; since the interval is so short the density is approximately constant and the result is the density (at the middle) multiplied by the length of the interval.

3 Likes

By getting a likelihood “this way”, I mean this:

Uhh, Bayesian inference?
Likelihood is always the “probability of the observed data conditional on the unknown parameters.”
In this case the “observed data” is truncated to only say whether it’s inside or outside the interval and—surprise, surprise—turns out it’s inside. This is the “approximation” in “Approximate Bayesian Computation.”

3 Likes

Hi, it’s me again…
I was exploring other models, which didn’t work at all, and I decided to test this again.

My issue is quite dumb, but I really I don’t get how @nhuurre calculated the derivative. I tried doing this:

d A_{3}/dt =d(\frac{1}{t \cdot inv\_logit(A_2\cdot t\cdot inv\_logit(A_1 t^2))})/dt

(both with wolfram alpha and sympy to be honest), and I got something quite different which didn’t work at all in the Stan model. Before doing the log, it would be:

(-1*(A_2*rt_*(2*A_1[i]*rt_^2*inv_logit(-A_1[i]*rt_^2) + inv_logit(A_1[i]*rt_^2))*inv_logit(-A_2*rt_*inv_logit(A_1[i]*rt_^2)) + inv_logit(A_2*rt_*inv_logit(A_1[i]*rt_^2)))/(rt_^2*inv_logit(A_2*rt_*inv_logit(A_1[i]*rt_^2))^2)))

So I would need to replace the target line with:

target += log(A_3[i])+ log (abs(-1*(A_2*rt_*(2*A_1[i]*rt_^2*inv_logit(-A_1[i]*rt_^2) + inv_logit(A_1[i]*rt_^2))*inv_logit(-A_2*rt_*inv_logit(A_1[i]*rt_^2)) + inv_logit(A_2*rt_*inv_logit(A_1[i]*rt_^2)))/(rt_^2*inv_logit(A_2*rt_*inv_logit(A_1[i]*rt_^2))^2)));

but as I said this didn’t work.

(I tried replacing values in the original derivative by @nhuurre and in mine and they are indeed different).

This is what I did in wolphram alpha:

where here f(x) and f'(x) are inv_logit(x) and inv_logit(-x) respectively in Stan. Sympy in python gave me exactly the same, so I’m must be doing something very silly.

I’m not entirely sure but I think you are incorrectly substituting inv_logit(-x) as the derivative of inv_logit(x) when the correct derivative is inv_logit(x)*inv_logit(-x).

Here’s how I calculated the derivative:
I mentioned earlier that the equation is

A_{3}=\frac{1}{tf\left(t,A_{1},A_{2}\right)}
f\left(t,A_{1},A_{2}\right)=h\left(A_{2}th\left(A_{1}t^{2}\right)\right)

where h is inv_logit.
Now, derivative of A_3

\frac{\partial}{\partial t}A_{3}=\frac{f\left(t,A_{1},A_{2}\right)-t\frac{\partial}{\partial t}f\left(t,A_{1},A_{2}\right)}{t^{2}f\left(t,A_{1},A_{2}\right)^{2}}

and substituting tA_3f\left(\dots\right)=1 simplifies that

=\frac{\frac{1}{tA_{3}}-t\frac{\partial}{\partial t}f\left(t,A_{1},A_{2}\right)}{A_{3}^{-2}}=\frac{A_{3}}{t}-\frac{A_{3}}{f}\frac{\partial}{\partial t}f\left(t,A_{1},A_{2}\right)

Then the part with f is chain rule and product rule

\frac{\partial}{\partial t}f\left(t,A_{1},A_{2}\right)=h^{\prime}\left(A_{2}th\left(A_{1}t^{2}\right)\right)\frac{\partial}{\partial t}\left(A_{2}th\left(A_{1}t^{2}\right)\right)
=h^{\prime}\left(A_{2}th\left(A_{1}t^{2}\right)\right)\left(A_{2}h\left(A_{1}t^{2}\right)+A_{2}t\frac{\partial}{\partial t}h\left(A_{1}t^{2}\right)\right)
=h^{\prime}\left(A_{2}th\left(A_{1}t^{2}\right)\right)\left(A_{2}h\left(A_{1}t^{2}\right)+A_{2}th^{\prime}\left(A_{1}t^{2}\right)\times\frac{\partial}{\partial t}\left(A_{1}t^{2}\right)\right)
=h^{\prime}\left(A_{2}th\left(A_{1}t^{2}\right)\right)\left(A_{2}h\left(A_{1}t^{2}\right)+A_{2}th^{\prime}\left(A_{1}t^{2}\right)\times2A_{1}t\right)

Substitute h^{\prime}\left(x\right)=h\left(x\right)h\left(-x\right) and collect terms

=h\left(A_{2}th\left(A_{1}t^{2}\right)\right)h\left(-A_{2}th\left(A_{1}t^{2}\right)\right)A_{2}h\left(A_{1}t^{2}\right)\left(1+th\left(-A_{1}t^{2}\right)\times2A_{1}t\right)

and that should be what I had before.

1 Like

This is one the reasons why the term “random variable” is next to useless.

A probabilisitic model is Stan is specified through a joint density function \pi(y, \theta) which is then partially evaluated on the data to give

\pi(\theta \mid \tilde{y}) \propto \pi(\tilde{y}, \theta).

Behind the scenes all of Stan’s algorithms query a model by evaluating this density function as various model configurations. That means that if you have some implicit system as part of your model it can always be solved deterministically, based on the current value of the model configuration \theta. Nothing is random about the evaluation of the model once the model configuration has been set!

The real problem here is that by assuming perfect measurements – the outputs of the implicit system are observed with infinite precision – the observation model becomes singular. Some sliver of model configurations will be consistent with any observation but most of the model configurations will be completely inconsistent with likelihoods/posterior densities of zero. In other words the posterior density function will not be smooth; the gradients everywhere but the sliver will be zero providing no information for finding the sliver and the gradients at the sliver in transverses directions will be infinite which breaks Hamiltonian Monte Carlo.

Let’s make this a bit more formal. We have some family of deterministic mappings that takes an input x to some output z, z = f(x; \theta). In your case this function is defined implicitly but it doesn’t matter so long as we can always solve for z given x and \theta. Because the measurements are assumed to be perfect the observational model becomes a Dirac delta function,

\pi(y \mid \theta) = \delta( y - f(x; \theta) )

which is zero everywhere that y \ne f(x; \theta).

What we care about for inference are the model configurations \theta consistent with an observation \tilde{y}$. If f is sufficiently well-behaved then we can essentially rewrite this observational model as

\pi(y \mid \theta) = \delta( \theta - f^{-1}(y; x) ) \, \left| \frac{ \partial f}{ \partial \theta }( \theta, x) \right|^{-1}.

Notice the appearance of the gradient that also showed up in @nhuurre’s method. One way of interpreting that method is directly parameterizing the set of model configurations where \theta = f^{-1}(\tilde{y}; x) instead of hoping that Hamiltonian Monte Carlo will find them. When this is feasible it is really the only productive approach.

What about adding some small measurement noise? Replacing the Dirac delta function with a narrow normal density function smooths out the sliver, regularizing all of the gradients so that an algorithm like Hamiltonian Monte Carlo stands a chance. That said, even if Hamiltonian Monte Carlo works it will require small step sizes and long numerical trajectories resulting in costly transitions. Working out \theta = f^{-1}(\tilde{y}; x) is far less expensive, but again is not often feasible.

Lastly what about ABC? Well ABC implicitly defines a pseudo-likelihood function based on some distance in observation space and a fixed tolerance. In this case that tolerance has a very similar impact to adding some small measurement noise, only with consequences that are harder to quantify because everything is so implicit. This is one of the fundamental problems with ABC – at best if implements inference for an approximate model and it’s very difficult to quantify how close that approximate model is to the exact model.

1 Like

@nhuurre, thank you! I had two mistakes in fact, everything is clear now. My other models are working as well now.

@betanalpha, thanks for the answer.
Maybe a silly question, but the ^{-1} is for the function right (you’re not suggesting 1/jacobian), but the equation should be like this, right?

\pi(y \mid \theta) = \delta( \theta - f^{-1}(y; x) ) \, \left| \frac{ \partial f^{-1}}{ \partial \theta }( \theta, x) \right|.

If I understand you correctly, @nhuurre’s method is doing basically that? Or am I wrong?
But I could also add measurement error on top of that, right?

Jacobian of the inverse function equals the inverse of the jacobian, see e.g. inverse function rule

\left|\frac{\partial f^{-1}}{\partial y}(y,x)\right|=\left|\frac{\partial f}{\partial\theta}(\theta,x)\right|^{-1}

That’s correct.

3 Likes

Yes, replacing the delta function with a less singular probability density function would account for additional variation in the measurement process. Smoothing out that singularity would also make the posterior density function a bit nicer.