QR decomposition and non-linear models

Dear all,

A collaborator and I are trying to fit a non-linear model to a largish data set. The model is analogous to

Y \sim Binomial(p, n) \\ logit(p) = \beta_0 + \beta_1 x_1^{c_1} + \beta_2 x_2^{c_2}

We are having issues getting the model to converge without divergences. With linear models, I have found QR decomposition often helps - however, I am not sure that it can be applied in a non-linear model such as the one above.

Any advice would be greatly appreciated.

Best wishes

Hey Robert! Welcome to the Stan forum! :)

Yes, QR decomposition can be applied in a logit model (really anywhere where you have a linear predictor). You’d have \text{logit}(p) = \alpha + Q\tilde\beta and \beta=R^{-1}\tilde\beta. I guess you already checked out the QR chapter in the user guide?


Hi Max,

Thanks very much for the quick answer.

The additional complexity here is that the equation for logit( p ) is also non-linear. Both x_1 and x_2 have exponents, so the RHS is \alpha + \beta_1 x_1^{c_1} + \beta_2 x_2^{c_2}, and unfortunately both c_1 and c_2 have to be estimated.

My initial thought was that there was no way to apply a QR decomposition to this scenario. However, perhaps it is possible to do something like make the columns of the matrix X_j =x_j^{c_j} and then

X = QR\\ logit(p) = a + Q\tilde{\beta} \\ \beta = R^{-1}\tilde{\beta}

I can see a way of doing this, but I think it would require the QR decomposition to be in the transformed parameters block - and re-estimated for each sample, which would be \bf{slow}, and I’m not even convinced that it would help that much. I’d love it if there was some neat linear algebra technique that allowed me to decompose the non-linear equation directly.

Thanks very much again for your help, and hopefully I’ve explained my problem more clearly this time!


p.s. thanks for the suggestion of the QR decomposition chapter in the manual! Really nice explanation, but it doesn’t go into non-linear implementations.

Hey there!

I see! Somehow I thought these were meant to be constants like in polynomial regression… I’m somehow not used to people using the term “non-linear regression” correctly like you did. haha, sorry! :)

Well, I think in that case all (QR related) hope is lost… and I think it is literally out of the scope of linear algebra (although I’m not a linear algebra expert, so I might be wrong). Re-doing QR every step with parameters inside the matrix… yeah, that’s really really slow…

Do you have any other constraints here that you could work with? Like x>0 or something? Or c_1 + c_2 = 1? Maybe there is some way to take logs?


(Yeah, I mostly referred to it for the more detailed explanation, the code and the scaling thing.)

Hi Max,

Well, I think in that case all (QR related) hope is lost… and I think it is literally out of the scope of linear algebra

That was my fear - the word “linear” is a bit of a clue :-(

Logs are not a natural re-parameterization, unfortunately, because x_1 and x_2 are supposedly additive - although a discussion I can have is whether we could assume otherwise…

While there is no constraint on c_1 + c_2, both x_1 and x_2 are \ge 0

All the best

Hi Robi,

Just a thought: if x1 or x2 is ever much greater than 1, then as either exponent gets even moderately large this thing’s pushforward density on the probability scale gets ridiculously concentrated near 1 or 0 (depending on the signs of the coefficients). I wonder, therefore, if this might point the way to some very strongly informative priors on the exponents (i.e. if you can rule out priors that place most of the prior mass for p within \epsilon of 0 or 1). Or perhaps a prior that constrains the coefficients to get very small as the exponents get large, which might allow you to express the coefficients in a way that breaks an extreme correlations between the coefficients and the exponents?

Is it accurate to say that the divergences are mostly at large values of the exponents and/or that there are strong posterior correlations between the coefficients and the exponents?


1 Like

I could be completely wrong, but maybe something like \beta\exp(c\log x), or in Stan code beta*exp(lmultiply(c, x)) , could already help.

I bet \beta isn’t constrained, but if it is like \beta > 0, then \exp\{\log (\beta ) + c\log (x)\} and you could declare log_beta as parameter.

Maybe you can even do something similar, where you decompose \beta into a scaling factor and a correlation, like \rho \cdot \exp\{\log (s) + c\log (x)\}, where s>0 and -1<\rho<1, and you declare log_s as parameter. So s governs the scaling of the effect and \rho the sign, with \beta=\rho \cdot s. In Stan code rho*exp(log_s + lmultiply(c,x)).

That being said, in either case I think more structure (constraints etc.) and tight priors will probably help.


Thanks very much Jacob and Max. Sorry for the slow response - I needed to think about your nice suggestions.

Yes, x_1 and x_2 can get pretty large. Constraints do seem sensible, but the question then becomes exactly how to impose them. We have thought of setting c_1 = 2\cdot logit^{-1}(v) at some point to restrict it to the 0 - 2 range. I need to check if that is in the most current implementation. We haven’t tried a model with b_j and c_j correlated yet, although we have discussed it a bit. I am not sure how we might do that though - would we draw b_j and c_k from a multi-normal with a strong covariance (e.g., by using an LKJ(0.5)) or something like that? We are assuming that c_j are positive and relatively small (<2) If the ecological theories are correct, \beta_j will be negative (but slight positive values are possible and have been observed in previous work).

We have played around with forcing c_1 and c_2 to be correlated (drawing them from a multivariate normal). It helped, but didn’t fix the divergences.

I hadn’t thought of using the \beta \exp(c \log x)) formulation - I didn’t know about the lmultiply function. That could actually be very nice and we’ll try it out. I will have a think about the decomposition you suggest Max. It would get us closer to the z-transformed version of densities that we got to fit in an earlier version, before someone suggested we might want the exponents.

Again, thanks very much for all this excellent advice!

Best wishes,