Help defining non-linear model


I am trying to define a non-linear model but I am not so sure on the form it should take. I have a dataset where items can belong to classes A or B, and let’s say one predictor X which determines to which class the item belongs to. However, I additionally know that some portion P of the items have no relation to X, and belong to classes A or B by accident. At the same time, I do not know how well X predicts class assignment for the non-randomly, so I cannot just look at the accuracy of the classifier to determine P.

Now, is it possible to define this model in brms (Stan)? I tried the model described here: (the last one) which is somewhat similar to what I expect to be happening, but when I run it on simulated data where I know the proportions of randomly assigned items, the estimates do not reflect these known proportions.

The data could look something like this (in this case there we assume x1 complete:

n <- 900
k <- 100
x1 <- inv_logit(rnorm(n+k, 0, 10))
class_t <- ifelse(head(x1, n) > 0.5, "a", "b") 
class_r <- ifelse(round(runif(k, 0, 1)) == 0, "a", "b")

df_t <- tibble(class = c(class_t, class_r),
              x1 = x1)

I also tried something like this:

model <- brm(bf(class ~ round(b2) * inv_logit(b1) + round(1 - b2) * c
               , b2 ~ 1
               , c ~ 1
               , b1 ~ 1 + x1
               , nl = TRUE)
            , data = df_t
            , family = bernoulli("identity")
            , prior = c(prior(normal(-4, 3), nlpar = "b1", lb = -10, ub = 0)
                      , prior(beta(1, 1), nlpar = "b2", lb = 0, ub = 1)
                      , prior(beta(1, 1), nlpar = "c", lb = 0, ub = 1)
            , iter = 4000, warmup = 1000, chains = 4, cores = 4
            , seed = 123
            , control = list(adapt_delta = 0.9)

But that also does not seem to do what I want. Any help would be greatly appreciated.


Hey Matias, your data generators both create class labels a and b with the same probabilities (.5 for both). The signal doesn’t differ from the noise.

Where you have just two labels; you have a binary classification problem. A natural first model would be a logistic regression: y_i \sim Bernoulli(logit^{-1}(\alpha + \beta x_i)), where x_i \in \mathbb{R} is your predictor and \alpha,\beta are the coefficients you’d be fitting. There’s a good intro to it fitting logit models with Stan here. Could this help?

Hi Emir,

Thank you for your answer but I don’t see how simple logistic regression helps me here. I am trying to estimate K, which is why I’m trying to define a non linear model. But maybe what I want isn’t possible?

IMO, from your data generating function; it’s impossible to recover k. Say Bob and Alice where flipping coins. They perform a total of 1000 flips of a fair coin between them. They both flip the coin in exactly the same way. How many times did Alice flip the coin?

Well, except that for my data, each time Alice and Bob flip a coin, they make a prediction first. Alice happens to be an expert coin flipper (let’s say she’s at 95% accuracy but I don’t actually know the exact value) while Bob is at random chance. In my data I have the prediction made by Alice and Bob. I know Bob flipped k coins, so at least k/2 coins should have a wrong prediction. Is there no way to estimate k in this scenario? Or at least fit a model which allows both both data generating processes to be taken into account?

The model y_i∼Bernoulli(logit^{−1}(α+βx_i)) is incorect for a portion of the data, where the correct model should be y_i∼Bernoulli(logit^{−1}(α)).

Actually, reading a bit about it, I think a mixture model is the correct solution. If a model is the mixture of two bernoulli distributions with the formula: bf(class ~ 1, mu1 ~ x1, mu2 ~ 1). The estimated mixing proportions do seem to closely resemble the proportions of the simulated data.

Hey Matias, Bernoulli mixtures won’t be identifiable. Say you have two Bernoulli random variables distributed according to \theta_1,\theta_2 \in [0,1] and combined according to \lambda \in [0,1]. We might model the likelihood of some data p(y=1\mid\theta_1,\theta_2,\lambda)=\lambda \theta_2 + (1-\lambda)\theta_2. We can see from the form that there exists some \theta_3 \in [0,1] such that \theta_3=\lambda \theta_2 + (1-\lambda)\theta_2, and hence the mixture can be replaced with a single Bernoulli variable with parameter \theta_3. But for any \theta_3 there is an infinity ways of picking \theta_1,\theta_2,\lambda.

Hence, any generating process that is a mixture of Bernoulli’s will collapse into a single Bernoulli, and any single Bernoulli might be decomposed into a mixture an infinity ways therefore Bernoulli mixtures are not identifiable without additional information.

Why does the model calculate correct mixing proportions for simulated data? I ran 20 or so simulations with different mixture proportions and the estimated mixing proportions were always (ballpark) “correct”.

It’s identifiable because there is additional information in X. The model will stop working if beta is so small that no observed value of X makes near-certain prediction or if alpha is so large that an item can belong to class A only by accident.

1 Like

I think this solves it then. Thank you!

1 Like

I hadn’t noticed that @Matias_Guzman_Naranjo were passing x1 to the model as data. That said, I still don’t see how this model is correctly identifiable. The mixture components are B(logit^{-1}(\alpha + \beta x)) and B(logit^{-1}(\alpha)), I guess?

One expects that “a” occurs in class_r at a rate of .5, and that 95% of N=n+k will be according to x1, and hence that ~ 95% of the mixture will be weighed towards B(logit^{-1}(\alpha + \beta x)). k is then .05*N. This would be equivalently answered analytically using: k=N-sum(x1>.5).

To me it seems that k, even if x1 is data, cannot be be recovered; it’ll always just be the compliment of what didn’t happen to align with the predictor. Would appreciate if you could elaborate a bit to the contrary.

Looking at it some more, since the incidental agreement with x1 in class_r will inform the component B(logit^{-1}(\alpha+\beta x)) I’d expect the situation to get more dire when k>n. Then at some point the majority of the allocation to B(logit^{-1}(\alpha+\beta x)) will actually be as a result of noise, and k will be more wrong.

In this case we have a very large beta so predictions are basically just whether x1 exceeds some unknown threshold and k is of course always the complement of what didn’t align with the prediction.
The question is whether you can recover the threshold. The answer is that you can measure the error rate on either side and the model assumes those must be equal. That assumption can narrow down the location of the threshold enough, at least in the simulated data.

1 Like