Model misspecified or only weakly predictive?

Hello,

I have been trying to use brms for parameter estimation with the following model:

anim.diff$full <- brm(formula = y ~ param1*param2*param3*param4 +
                      (1 | sub),
                    data = dat, 
                    family = student(),
                    warmup = 1000, iter = 5000, 
                    cores = parallel::detectCores(),
                    chains = 4, control = list(adapt_delta = .99), 
                    prior = prior, 
                    save_pars = save_pars(all = TRUE))

Where I get a posterior for some of the parameters with a high mass above/below zero, with e.g., P(param < 0) = 0.96. However, when I then use loo to compare to a model without the respective parameter, the model without it is always the winning model. In fact, an intercept-only model seems to perform better than a model with any combination of the four predictors, while bayes_R2 of the full model including all predictos is quite a lot larger than the bayes_R2 of the intercept-only model (R2 = 0.25 vs R2 = 0.04).

Iā€™m now trying to find out whether this is because my model is somehow misspecified, or whether it really means my model does in fact have very low out of sample predictive power.

Loo shows me the effective number of parameters (p_loo) is very high (suggesting bad misspecification), while pareto-k values are ok:

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     21.7  7.8
p_loo        19.8  2.2
looic       -43.5 15.6
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     103   96.3%   3020      
 (0.5, 0.7]   (ok)         4    3.7%   3048      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

But I canā€™t seem to figure out why/how my model is misspecified.

This is the fit of my posterior predictive distribution:

as well as mean recovery:

plotting loo_pit shows some diversion from a uniform distribution:

N.B. my outcome variable ranges from -1 to 1, where only certain values are possible (calulated based on change in number of correct responses out of 8 trials).

I would be very grateful if anyone could offer some insight on how I should interpret these results & diagnostics. Thanks!

Can you show the loo result also for the intercept-only model?

Do you really need all the interactions?

How many observations per sub?

As there are so few different outcomes, how about using the actual difference in the number and ordinal model?

Or why not predict the number of correct responses in the second case given the first case as a predictor, and then you can use binomial or ordinal model?

You may get strange results because of using Studentā€™s t for the discrete and compact range data.

Thanks a lot for your response.

Can you show the loo result also for the intercept-only model?

Sure.

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     27.9  6.9
p_loo         5.8  0.6
looic       -55.9 13.9
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Do you really need all the interactions?

Theoretically they make sense. I could also lose some of the interactions, but this wonā€™t change the fact that already a model with just one of those parameters, while showing a meaningful posterior, doesnā€™t outperform the intercept only model.

How many observations per sub?

4 in total, 2 per stimulus condition which is modelled by one of the parameters.

As there are so few different outcomes, how about using the actual difference in the number and ordinal model?

Or why not predict the number of correct responses in the second case given the first case as a predictor, and then you can use binomial or ordinal model?

To clarify, Iā€™ve calculated the change data to model a treatment effect (drug - placebo). Unfortunately I cannot take the differences between individual trials, because participants saw different stimuli in drug and placebo conditions, which are not directly comparable. Iā€™m not sure Iā€™m getting at what you proposed - please let me know if not.

Iā€™ve initially tried modelling the data with a cumulative distribution, as this seemed more apt to me, given the discreteness of my data - however i then realised a gaussian model outperforms the cumulative one by a large margin.

And I only tried the studentā€™s to get at the misspecification issues, and because I do have slightly denser tails in my data - and indeed this did outperform the gaussian model in loo comparison.

Thanks!

How did you do the comparison? CV-FAQ says

  • You canā€™t compare densities and probabilities directly. Thus you canā€™t compare model given continuous and discrete observation models, unless you compute probabilities in intervals from the continuous model (also known as discretising continuous model).

Given what you told about the range and discreteness, the comparing loo outputs directly will strongly favor continuos models, and you need to make the discretization properly in order to make a fair comparison to a discrete observation model.

Also, even if assuming that a continuous model would be a good approximation here, it would be good to use left and right truncation, as now the log-density at the observations depends also on how much posterior mass there is in the tails beyond the known range of the data.

Thanks again for your reply!

How did you do the comparison? CV-FAQ says

  • You canā€™t compare densities and probabilities directly. Thus you canā€™t compare model given continuous and discrete observation models, unless you compute probabilities in intervals from the continuous model (also known as discretising continuous model).

I see, I have read most of this CV-FAQ but actually missed this important detailā€¦ so thank you for pointing this out. I guess I would have expected some sort of error to show in the loo function in that case.

Also, even if assuming that a continuous model would be a good approximation here, it would be good to use left and right truncation, as now the log-density at the observations depends also on how much posterior mass there is in the tails beyond the known range of the data.

I have added left and right truncation (lb = -1, ub = 1) to my model, but I can see no difference in terms of posteriors or loo metrics

> loo(anim.diff$full)

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     22.7  7.6
p_loo        15.4  1.9
looic       -45.4 15.3
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     106   99.1%   2874      
 (0.5, 0.7]   (ok)         1    0.9%   4244      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
> loo(anim.diff$full.trunc)

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     22.7  7.6
p_loo        15.5  1.9
looic       -45.4 15.3
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     100   93.5%   2963      
 (0.5, 0.7]   (ok)         7    6.5%   3448      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

loo_compare(loo(anim.diff$full), loo(anim.diff$full.trunc))

                     elpd_diff se_diff
anim.diff$full.trunc 0.0       0.0    
anim.diff$full       0.0       0.1

So I guess my main question remains: to what extent can I interpret the posteriors if the loo evaluation suggests model misspecification (regardless of whether Iā€™m using continuous or discrete models) - or in other words would it be sensible to dismiss the CV results if my main goal is just parameter estimation, and given the model fit seems reasonable (bayes_R2 of 0.25)?

The comparison can be valid in certain special case, and thus there is no error, but warning, but that warning can only be given if there is information on the observation model or data. The warning can be gicen, e.g., if you use add criterion function in brms and compare brms model objects. But if you compare just loo objects, the necessary information needed for the warning may have been lost.

I would continue with computing the correctly discretized predictive log probabilites for the continuous model, as that would provide you useful additional information.

1 Like

Iā€™m sorry, Iā€™m not quite sure what you mean by ā€˜computing the correctly discretized predictive log probabilites for the continuous modelā€™, would you mind elaborating or pointing me to somewhere that explains this?

Iā€™ve googled ā€˜discretising continuous modelsā€™, but all I find is info on how to discretise continuous variables.

CV-FAQ mentions

compute probabilities in intervals from the continuous model (also known as discretising continuous model).

and I agree that itā€™s quite compact explanation. A student is in progress of making a vignette, but meanwhile hereā€™s a quick hand drawn figure


Instead of using the log of the density d directly, you can compute the probability by dividing the range in small intervals and multiply the density with the width of the interval. If I understood correctly, in your case the intervals are equidistant. I donā€™t see that you would have told how many unique values, but if you would have 11 distinct values between -1 and 1 that is
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
then each interval has width 0.2, and you get approximate probabilities (making piecewise constant approximation of the continuous function) by multiplying the densities by 0.2 which corresponds to adding log(0.2) \approx -1.61 to each pointwise value and if you have 107 observations, adding 107*log(0.2) \approx 172.2 to total elpd_loo. If you tell more details about your data I can check that your computation will be correct. If you use truncation, you should truncate from -1-(b-a)/2 to +1+(b/a)/2 to keep the edge discretization bins the same size as other bins.

2 Likes

Thanks very much for that explanation - Iā€™m just still a bit unclear on what densities I will use to multiply the bin width by. Do I use an arbitrary number of draws from the posterior predictive distribution (e.g., using posterior_epred), then take the mean of that and multiply this by the 0.2?

Use the log densities given by the loo() function (e.g. elpd_loo and pointwise[,'elpd_loo'] values. And do the computation in the log scale as I mentioned above. Assuming the variable binwidth includes the binwidth and N is the number of observations, you can compute

loo1 <- loo(fit1)
loo1$elpd_loo <- loo1$elpd_loo + N*log(binwidth)
loo1$estimates['elpd_loo','Estimate'] <- loo1$elpd_loo
loo1$pointwise[,'elpd_loo'] <- loo1$pointwise[,'elpd_loo'] + log(binwidth)

(This should work, but as I donā€™t have your data and models, Iā€™m not able to be 100% certain)

1 Like

Ah, I see. Iā€™ve done that now and as I understand it, after using the discretization, loo still suggests the continuous model is the better one:

using bin width 0.125 from 17 possible values between -1 and 1 I get for the discretized continuous truncated model:

anim.diff$full.trunc <- brm(formula = perc_accurChange | trunc(lb = -1.0625, ub = 1.0625) ~ param1*param2*param3*param4,
                    data = dat, 
                    family = student(),
                    warmup = 1000, iter = 5000, 
                    cores = parallel::detectCores(),
                    chains = 4, control = list(adapt_delta = .99), 
                    prior = prior3, 
                    save_pars = save_pars(all = TRUE))

loo1 <- loo(anim.diff$full.trunc)
loo1$estimates['elpd_loo','Estimate'] <- loo1$estimates['elpd_loo','Estimate'] + 107*log(0.125)
loo1$pointwise[,'elpd_loo'] <- loo1$pointwise[,'elpd_loo'] + log(0.125)

> loo1

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo   -199.8  7.6
p_loo        15.4  1.9
looic       -45.4 15.2
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     103   96.3%   3524      
 (0.5, 0.7]   (ok)         4    3.7%   1731      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

And for the cumulative model:

anim.diff$full.ord <- brm(formula = perc_accurChange.ord ~ param1*param2*param3*param4  +
                      (1 | sub),
                    data = dat, 
                    family = cumulative,
                    warmup = 1000, iter = 5000, 
                    cores = parallel::detectCores(),
                    chains = 4, control = list(adapt_delta = .99), 
                    prior = prior2, 
                    save_pars = save_pars(all = TRUE))

> loo(anim.diff$full.ord)

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo   -213.5  7.2
p_loo        14.0  1.3
looic       427.0 14.4
------
Monte Carlo SE of elpd_loo is 0.0.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     106   99.1%   8243      
 (0.5, 0.7]   (ok)         1    0.9%   9259      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

and comparing the two models:

> loo_compare(loo1, loo(anim.diff$full.ord))
                     elpd_diff se_diff
anim.diff$full.trunc   0.0       0.0  
anim.diff$full.ord   -13.7       5.4  
Warning message:
Not all models have the same y variable. ('yhash' attributes do not match) 

Very happy to provide more details about my data if that helps any further insight into the problems at hand: My aim is to estimate the relationship between effects of a certain drug on accuracy in my main task with drug effects on some auxiliary cogntitive tasks. Ultimately Iā€™m interested in gauging cognitive processes that were affected by the drug by evaluating the relationship between drug effects on different cognitive domains.

For this, I calculated ā€˜accuracyā€™ as percentage of correct trials out of 8 total. My final outcome variable is then calculated as change in accuracy as result of a drug treatment, i.e., by subtracting accuracy values obtained under drug from those obtained under placebo. Unfortunately I cannot use individual trial change data for reasons Iā€™ve tried to explain in a post further up.
Three of my predictors are drug effects on the above mentioned auxiliary tasks, all continuous and z-standardised. the fourth predictor codes for 2 different levels of a stimulus condition (where I might expect relationships between drug effects for one, but not the other stimulus condition).

Please let me know if there is any other information that could be useful to better understand my data & models.

Dropping one of the predictors and some of the interactions and using the truncation, I get a more appropriate looking p_loo - but model comparison still ā€˜favorsā€™ the intercept only model. Though I realised that the the elp_diff is smaller than the uncertainty se_diff associated with it - do I take from this that I canā€™t trust this comparison?

anim.diff$m4 <- brm(formula = perc_accurChange | trunc(lb = -1.0625, ub = 1.0625) ~ param1*param4 + *param2*param4,
                    data = dat, 
                    family = student(),
                    warmup = 1000, iter = 5000, 
                    cores = parallel::detectCores(),
                    chains = 4, control = list(adapt_delta = .99), 
                    prior = prior3, 
                    save_pars = save_pars(all = TRUE))

> LOO(anim.diff$m0.subs, anim.diff$m4)
Output of model 'anim.diff$m0.subs':

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     28.8  6.9
p_loo         1.9  0.2
looic       -57.6 13.9
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Output of model 'anim.diff$m4':

Computed from 16000 by 107 log-likelihood matrix

         Estimate   SE
elpd_loo     27.7  7.1
p_loo         6.5  0.8
looic       -55.3 14.3
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Model comparisons:
                  elpd_diff se_diff
anim.diff$m0.subs  0.0       0.0   
anim.diff$m4      -1.1       2.3 

Yes. However, now the magnitude of the difference is feasible (ie not very big), and I trust the result more.

Also the magnitude of the difference is typical (ie small) when the data do not have enough information about the new parameters.

In your case, the number of observations is quite small, and the default prior for the coefficients is uniform which is favoring overfitted models. It would be good to use tighter priors. It gets bit more complicated to set good priors when you have all the interactions, but at least there are papers on priors for pairwise interactions. It would be often sensible to assume that the interaction effect is not bigger than either of the corresponding main effects. Itā€™s likely that brms doesnā€™t have yet an easy to way to use such a prior, so I would at least start with the tight independent normal prior to check the sensitivity of the results and if that helps, then consider specific priors for the interactions.

1 Like

Aki, you mentioned papers on priors for pairwise interactions, but a search turns up many such papers. Can you recommend a few key references for both intuition and practical implementation for setting priors on interactions?

1 Like

In your case, the number of observations is quite small, and the default prior for the coefficients is uniform which is favoring overfitted models. It would be good to use tighter priors.

Sorry I shouldā€™ve clarified that Iā€™m not using brms default priors, but priors for coefficients which I deem to be quite narrow, e.g., for the studentā€™s t model:

prior3 <- c(set_prior("normal(0,0.1)", class = "b")
            set_prior("normal(0,0.5)", class = "Intercept"))

However Iā€™ve not looked into specifying priors for interactions that are different to the main effects. So if you can recommend any of those papers discussing interaction priors, that would be great.

1 Like

Great! Then you can also use priorsense to check prior/likelihood sensitivity (possible conflict or weak likelihood). See GitHub - n-kall/priorsense: priorsense: an R package for prior diagnostics and sensitivity

Oh, Iā€™ve seen only a few. Happy to see a longer list. Iā€™ve liked

If you find other useful papers, please post here

3 Likes

Thanks for this very useful hint. The sensitivity analysis for my main model did indicate prior-data conflict for my two interactions of interest, and these warnings disappeared after adjusting my coefficient prior SDs from 0.1 to 0.2. However I canā€™t see a lot of difference in terms of the plots (plot1: coefficient priors = normal(0,0.1); plot2: priors = normal(0,0.2)):


Am I correct in my understanding that what I want is a prior which is weakly informative, but not sensitive, whereas I do want sensitivity in my likelihood - as this indicates my data is indeed informative with regards to the model? So if I trust the priorsense analysis (and the lack of sensitivity diagnoses) after adjusting my priors, this means I can be more confident my posteriors are indicative of actual effects?

Hi, nice that you tried out priorsense, hopefully I can answer some of your questions:

Sometimes the density plots are not that clear, you can try using powerscale_plot_quantities which can give a clearer idea of what quantities are changing (e.g. mean, sd, divergence) and by how much.

In this case visually there appears to be still some conflict as the distributions appear to move in opposite directions (weakening likelihood moves to the left, weakening prior moves to the right) , perhaps the diagnostic values were just below the flagging threshold in the second case? Maybe a slightly wider prior could be used.

I think your understanding is correct. If your goal is to have weakly informative priors, then low prior sensitivity + high likelihood sensitivity would be a good target.

If you define ā€˜actual effectā€™ as ā€˜effect indicated by the likelihoodā€™, then yes. Specifically, I would interpret low prior sensitivity + high likelihood sensitivity from priorsense as indicating that the prior is not very influential on the posterior, while the data / likelihood is. If your goal is to use weakly informative priors with the hope of basing the posterior mostly on the likelihood / data, then you can be more confident that this is the case.

1 Like

Yes, itā€™s great to know that such a package exists and can provide a fast & no-fuss route to sensitivity analyses. After updating my brms function to the latest version the implementation of the package was super straightforward.

Yes, since I do not have any strong assumptions about the magnitude of my effects based on previous literature, iā€™d want the posterior to be mostly influenced by the likelihood.

Thanks a lot @avehtari and @n-kall for your help with this, I really appreciate being able to ask all sorts of modelling questions in this forum!

2 Likes

You would like to have informative likelihood, but sometimes you just donā€™t have that, and then you have to live with that. If the likelihood is weak, you know that the prior choice can matter a lot, and you need to justify the prior well. If there is a prior-likelihood conflict, it is possible that itā€™s due to a quick choice of prior and additional thinking can give you a better non-conflicting prior, but it is also possible that sometimes you really do have strong prior information and just by chance the data are such that there is a conflict. If the prior-likelihood sensitivity analysis indicates some issues, it always means you need to think more and be more careful about the justifications for the priors and model.

1 Like

I see - thanks for this additional explanation. Unfortunately I donā€™t have any data to base my priors on, so I do have to be careful. I also have generally found it quite difficult to confine priors to some (weakly) informative range when there is not a lot of prior information. Thatā€™s why the sensitivity analysis tool is really super useful :)