Dear all,

A collaborator and I are trying to fit a non-linear model to a largish data set. The model is analogous to

Y \sim Binomial(p, n) \\
logit(p) = \beta_0 + \beta_1 x_1^{c_1} + \beta_2 x_2^{c_2}

We are having issues getting the model to converge without divergences. With linear models, I have found QR decomposition often helps - however, I am not sure that it can be applied in a non-linear model such as the one above.

Any advice would be greatly appreciated.

Best wishes

Robi

Hey Robert! Welcome to the Stan forum! :)

Yes, QR decomposition can be applied in a logit model (really anywhere where you have a linear predictor). Youâ€™d have \text{logit}(p) = \alpha + Q\tilde\beta and \beta=R^{-1}\tilde\beta. I guess you already checked out the QR chapter in the user guide?

Cheers,

Max

Hi Max,

Thanks very much for the quick answer.

The additional complexity here is that the equation for logit( p ) is also non-linear. Both x_1 and x_2 have exponents, so the RHS is \alpha + \beta_1 x_1^{c_1} + \beta_2 x_2^{c_2}, and unfortunately both c_1 and c_2 have to be estimated.

My initial thought was that there was no way to apply a QR decomposition to this scenario. However, perhaps it is possible to do something like make the columns of the matrix X_j =x_j^{c_j} and then

X = QR\\
logit(p) = a + Q\tilde{\beta} \\
\beta = R^{-1}\tilde{\beta}

I can see a way of doing this, but I think it would require the QR decomposition to be in the transformed parameters block - and re-estimated for each sample, which would be \bf{slow}, and Iâ€™m not even convinced that it would help that much. Iâ€™d love it if there was some neat linear algebra technique that allowed me to decompose the non-linear equation directly.

Thanks very much again for your help, and hopefully Iâ€™ve explained my problem more clearly this time!

Robi

p.s. thanks for the suggestion of the QR decomposition chapter in the manual! Really nice explanation, but it doesnâ€™t go into non-linear implementations.

Hey there!

I see! Somehow I thought these were meant to be constants like in polynomial regressionâ€¦ Iâ€™m somehow not used to people using the term â€śnon-linear regressionâ€ť correctly like you did. haha, sorry! :)

Well, I think in that case all (QR related) hope is lostâ€¦ and I think it is literally out of the scope of *linear* algebra (although Iâ€™m not a linear algebra expert, so I might be wrong). Re-doing QR every step with parameters inside the matrixâ€¦ yeah, thatâ€™s really really slowâ€¦

Do you have any other constraints here that you could work with? Like x>0 or something? Or c_1 + c_2 = 1? Maybe there is some way to take logs?

Cheers,

Max

(Yeah, I mostly referred to it for the more detailed explanation, the code and the scaling thing.)

Hi Max,

Well, I think in that case all (QR related) hope is lostâ€¦ and I think it is literally out of the scope of *linear* algebra

That was my fear - the word â€ślinearâ€ť is a bit of a clue :-(

Logs are not a natural re-parameterization, unfortunately, because x_1 and x_2 are supposedly additive - although a discussion I can have is whether we could assume otherwiseâ€¦

While there is no constraint on c_1 + c_2, both x_1 and x_2 are \ge 0

All the best

Robi

Hi Robi,

Just a thought: if x1 or x2 is ever much greater than 1, then as either exponent gets even moderately large this thingâ€™s pushforward density on the probability scale gets ridiculously concentrated near 1 or 0 (depending on the signs of the coefficients). I wonder, therefore, if this might point the way to some very strongly informative priors on the exponents (i.e. if you can rule out priors that place most of the prior mass for p within \epsilon of 0 or 1). Or perhaps a prior that constrains the coefficients to get very small as the exponents get large, which might allow you to express the coefficients in a way that breaks an extreme correlations between the coefficients and the exponents?

Is it accurate to say that the divergences are mostly at large values of the exponents and/or that there are strong posterior correlations between the coefficients and the exponents?

Cheers

Jacob

1 Like

I could be completely wrong, but maybe something like \beta\exp(c\log x), or in Stan code `beta*exp(lmultiply(c, x)) `

, could already help.

I bet \beta isnâ€™t constrained, but if it is like \beta > 0, then \exp\{\log (\beta ) + c\log (x)\} and you could declare `log_beta`

as parameter.

Maybe you can even do something similar, where you decompose \beta into a scaling factor and a correlation, like \rho \cdot \exp\{\log (s) + c\log (x)\}, where s>0 and -1<\rho<1, and you declare `log_s`

as parameter. So s governs the scaling of the effect and \rho the sign, with \beta=\rho \cdot s. In Stan code `rho*exp(log_s + lmultiply(c,x))`

.

That being said, in either case I think more structure (constraints etc.) and tight priors will probably help.

Cheers,

Max

Thanks very much Jacob and Max. Sorry for the slow response - I needed to think about your nice suggestions.

Yes, x_1 and x_2 can get pretty large. Constraints do seem sensible, but the question then becomes exactly *how* to impose them. We have thought of setting c_1 = 2\cdot logit^{-1}(v) at some point to restrict it to the 0 - 2 range. I need to check if that is in the most current implementation. We havenâ€™t tried a model with b_j and c_j correlated yet, although we have discussed it a bit. I am not sure how we might do that though - would we draw b_j and c_k from a multi-normal with a strong covariance (e.g., by using an LKJ(0.5)) or something like that? We are assuming that c_j are positive and relatively small (<2) If the ecological theories are correct, \beta_j will be negative (but slight positive values are possible and have been observed in previous work).

We have played around with forcing c_1 and c_2 to be correlated (drawing them from a multivariate normal). It helped, but didnâ€™t fix the divergences.

I hadnâ€™t thought of using the \beta \exp(c \log x)) formulation - I didnâ€™t know about the lmultiply function. That could actually be very nice and weâ€™ll try it out. I will have a think about the decomposition you suggest Max. It would get us closer to the z-transformed version of densities that we got to fit in an earlier version, before someone suggested we might want the exponents.

Again, thanks very much for all this excellent advice!

Best wishes,

Robi

2 Likes