No adjustment after warmup period

I am experiencing a problem where parameters only vary in the warmup stage and then become constant. The problem is caused by adding Jacobian adjustment terms into the model which may complicate likelihood geometry. That is, if these terms are not in the likelihood, sampling happens as normal.

I have tried increasing adapt_delta but the problem persists.

Are there any other methods you would suggest trying out?

Update: “parameters only vary in the warmup stage” is not entirely correct – little to no variation happens in the first 10-20 iterations and THEN parameters stay mostly constant.

Haha, I see you simplified your question, but when I read the previous version of it I think my conclusion was I needed more info :D.

What is your transformation? If you’re doing a Jacobian adjustment, I assume it’s because you need to evaluate the log probability of a transformed parameter? Is this the case?

Btw you can do \LaTeX math inline by wrapping stuff up with dollar signs if that helps.

Thanks for the prompt reply!

The Jacobian adjustment happens because of change-of-variables technique: the likelihood for observed item quantity choice is formed based on the likelihood of item-level “(taste) shocks” (gumbel distributed). Item level “shocks” are a function of parameters and data – they are backed out based on (Kuhn-Tucker) optimization conditions.

Because we use “shocks” to form the likelihood for quantity choice, we need the Jacobian.

Hmm, sounds complicated, best I can do is maybe help you check your logic. I think we should definitely worry about bugs before assuming the correctly-implemented model has some impossible to sample posterior, but that’s totally possible. Lemme just walk through the transformation to make sure we’re all on the same page.

So we have things defined in the model:

x = \text{data, n of these}\\ s = \text{shocks, m of these}\\ \theta = \text{parameters of space model will be sampled on, m of these, right?}

Your likelihood is:

p(x | s) p_s(s)

Unfortunately, we can’t sample s directly but have to write it as a transformed variable, s = f^{-1}(\theta), so you need to do a transformation like in the Multivariate Change of Variables section of the manual to get things working out right?

So the thing we’ll end up sampling is:

p(x | s) p_s(f^{-1}(\theta)) | \text{det} J_{f^{-1}} (\theta)|

That all sound right?

Yes, the logic is correct except the number of (taste) parameters is not related to the number of shocks. These “shocks” are like residuals and help explain why user may make choices that are different than those predicted by their taste parameters.

There may be some bug but if I literally comment out a row, the model runs:


And when I print out the sizes of these three quantities, I get

Jacob_part: -3072.29 lpdf_part -1754.76 lcdf_part -245.459 (when Jacobian part is commented out)
Jacob_part: -3172.12 lpdf_part -5754.491 lcdf_part -104.131 (when Jacobian part is part of the code)

I have added a plot to illustrate what the problem looks like. The problem features 30 iterations but the picture is the same for 300 or more iterations – very little to no adjustment in the beginning and then constant.Rplot.pdf (5.1 KB)

So to do the Jacobian thing you need to be able to do your transformation both ways.

If s = f^-1(\theta), I don’t think they s and \theta need to live in spaces of the same dimension, but you gotta have unique solutions in both directions. So you gotta be able to solve for \theta given s and also s given \theta.

Saying that the number of parameters is not related to number of shocks makes me think that assumption broke?

The sequence that I have in mind is:
data & (taste) parameters - > implied shocks (for each quantity) – > likelihood for quantity.

There is one-to-one mapping between quantity (observed) and shocks (latent). Parameters and data affect this mapping. We use distributional assumptions of latent shocks to learn about the likelihood of observed quantity.

If it is helpful I can include a one-page snippet from the paper to illustrate the idea.

It’s often helpful to produce diagnostic output and look at the gradient/momentum values therein. In cmdstan this means setting the diagnostic_file argument.

The data here – that’s covariates, right? That’s different from “quantity” in “likelihood for quantity” right?

So do the latent variables really need to be a function of the parameters and covariates (assuming the data is covariates)?

Would it make sense to parameterize the means of the latent variables as a function of the parameters and covariates? There wouldn’t be any volume adjustment then, and I think this is the more common modeling paradigm anyway. Like this (I’ve relabeled x btw):

y = \text{data}\\ \theta = \text{parameters}\\ s = \text{shocks}\\ x = \text{covariates}

And putting normal distributions everywhere just to make it easy:

\theta \sim \text{normal}(\mu, \sigma_\theta)\\ s_i \sim \text{normal}(f_i(x, \theta), \sigma_s)\\ y_i \sim \text{normal}(s_i, \sigma_y)

In that way you sample the s latents directly, they’re a function of your parameters and covariates, and there’s no transform.

Yeah, I probably won’t understand it all, but I’ll check it out if you post it.

The data is quantity as well, so we have: y,\theta- > s (\theta,y) - > p(y|\theta)=p_s(f^{-1}(y,\theta))\cdot |det J_{f^{-1}}(y,\theta)| and we want to sample from p(y|\theta)p(\theta).

Using the hierarchy you used above:

\theta \sim \text N(\mu,\sigma_\theta)\\ s=f^{-1}(y,\theta) \sim \text{Gumbel} \\ y \sim p(y|\theta)=p_s(f^{-1}(y,\theta))\cdot |det J_{f^{-1}}(y,\theta)|

I have added a cut-out from the paper: page 1 is just for the sake of completeness. The important stuff is on page 2; Equations (4) and (5) define the mapping s=f^{-1}(y,\theta) and likelihood is defined right below it.

I am open to try and reparameterize the model for sampling but I cannot see an alternative way to do it with the current modeling assumptions which are structural in nature.

To sum up, we are not comfortable putting distributional assumptions on data (quantity) directly but rather use the distribution of latent shocks and the structural link between shocks and quantity (based on utility maximization) to derive the likelihood for data (quantity).

likelihood_formation.pdf (263.0 KB)

Niiice, good stuff. Lemme make sure I’m getting this right:

\epsilon_{scb} is a shock

p_{scb} are the parameters we’re sampling, and the eqs at the top of page 2 show us how to go from p_{scb} to \epsilon_{scb}. It looks like there’s just a scalar transformation for every p_{scb}/\epsilon_{scb} pair.

q_{scb} are quantities. Any chance p_{scb} are prices?

It looks like \epsilon_{scb} = V_{scb}.

What’s bugging me is that I expected to find a distribution on the shocks, \epsilon_{scb}, and the right hand side of the eq looks like it is something like that (if V_{scb} = \epsilon_{scb}). The left hand side, however, says this distribution is p(q_1, q_2, ..., q_K)? Is there a relation between q_{scb} and \epsilon_{scb} that I’m missing? I see how the \Psi_\epsilon term produces the e^{\epsilon_{scb}} s.

Also the Jacobian adjustments are written as if the transform we care about is from q_{scb} to \epsilon_{scb}, and, as far as I can tell, we have a transform from p_{scb} to \epsilon_{scb} (but this is probly just another variant of what I’m not understanding above).

Why doesn’t u^{T} B u fall out of the Lagrangian?

And why does \Psi_{scb} multiply the derivative of u^{T} B u in the optimality conditions? It looks like it’s just dotted with q in the original Lagrangian.

Hopefully this snail’s pace check is useful for you haha. Maybe it’ll eventually turn into insight into your problem rather than just me askin’ questions.