Log target vs. non linear model

Key question: should I use log(y) ~ … OR y ~ exp(…) OR y ~ … with lognormal family…or something else…

I’ve inherited a model that looks like this:

m1 <- brm(
  bf(y / off_set ~ exp(b0 + b1 * f1 + b2 * f2),
     b0 ~ 1,
     b1 ~ 1,
     b2 ~ 1, 
     nl=TRUE),
  prior = c(
    prior(normal(0, 5), nlpar = "b0"),
    prior(normal(0, 5), nlpar = "b1"),
    prior(normal(0, 5), nlpar = "b2")
  ),
  data = df,
  family = gaussian()
)

It gives reasonable predictions, but, given y > 0, off_set > 0, the ppc_check doesn’t look ideal to me:

I"d thought a better and simpler way to specify the model would be

m2 <- brm(log(y / off_set) ~ f1 + f2,
           data = df)

Which gives a nice looking ppc_check plot:

But on a crude measure of fit, the first seems to perform better - that is, comparing y to exp(pred) * off_set vs. pred * offset for the first model…

Any help / guidance much appreciated!

I would question that measure of fit then. The PPCs for the second thing look way better – maybe it’s something to do with the scales here? One is on the log scale and the other is not.

From that second plot too, just go ahead and use a lognormal family if your data is constrained positive. Use whichever scale makes more sense for your problem. If it’s the log thing that you’re concerned with, just do log(y) ~ normal(...). If the application makes more sense on the non-log scale, then do y ~ lognormal(...).

Super, thanks for your reply, appreciate it!

One follow up question…

The distribution i’m most interested in is the sum of the predictive posteriors. I.e. once i’ve got e.g. 500 samples for N data points, i want to sum over N and take summary stats from that distribution.

The problem is with x ~ lognormal(...) i get way high values. e.g. a median ~ 6x larger than the actual sum of x in the training data.

Is the solution to truncate the distribution? If so, how do I get samples for truncated lognormal with below generated quantities?

Thanks in advance

generated quantities{
      real y_pred[N];

      for(n in 1:N){
           y_pred[n] = lognormal_rng(mu[n], sigma); 
      }
 }

I’m not sure what the solution is but the diagnostic is similar to the previous one.

When we make predictions with this new thing, even though the distribution of y and y_rep looks similar (at least I’m assuming on a log scale they look like the second plot here: Log target vs. non linear model), sum(y_rep) produces values far away from sum(y) (make a plot of this probly). Probably make a plot of this to see what is happening.

I’m not sure about a solution though. Maybe it is that you are able to fit individual outputs but the sum seems off? It could be the individuals aren’t fit as well as we think, or the outputs weren’t IID or something – I’m not sure. It is definitely fair game for a new post (that way you might attract more new eyes to the problem and it is different than original question here).

1 Like

I might have similarly structured data and would be happy to share my experience, but I am not sure how similar it is. Can you talk more about what y and off_set mean?
How variable is off_set? Why are you interested in the sum over N datapoints?

1 Like

Sure! y and offset are sales and population by geographic region, respectively. Both right skewed and the ratio between them has been seen as an appropriate target variable (though ultimately sales predictions are what we’re after).

And the aggregation we want is by group - i.e. the prediction distribution for a number of categories (e.g. shop or city etc).

Thnx