How can I select a ROPE range for a logistic model based on odds ratios

Hello,

I am trying to understand how to specify ROPE when fitting a logistic model.
Let’s say I would like to detect whether the effect of a dummy-coded predictor (e.g., intervention) is larger than 1.20 or smaller than 0.8 on an odds ratio. I want to examine whether the intervention increases/decreases the outcome value by 20% or larger; if not, the intervention does not make a meaningful difference.

In this case,
ln(1.20) = 0.1823215568
ln(0.8) = -0.2231435513

So, I should set my ROPE as -0.2231435513, 0.1823215568.

Am I understanding this right? I feel this makes sense, but I did not encounter such a simple guideline, so I wonder if I might be misunderstanding something.

  • added:

I found this post (ROPE range specification for binary models) and started understanding more clearly, but I still feel I am not 100%.

So, does the log(1.1) in the post refer to ln(1.1)?

Then, I guess my above approach is in the right path, but the interpretation should be:

I want to examine whether the intervention increases/decreases the ODDS of the outcome value (i.e., success/fail rate) by 20% or more; if not, the intervention does not make a meaningful difference.

I may be struggling, as English is not my first language, and I am not keen on math. If somebody can confirm my understanding, I would greatly appreciate it.

Hey, @a_t. I think part of the difficulty here, which is part of the difficulty that often comes up with logistic regression, it there are many ways to talk about the results. For example, at times you are talking about odds ratios, and other times you are referencing percentages. Let’s get specific. How exactly do you want to express your outcome? As an odds ratio? As a difference in probabilities? As one probability expressed as a percent change relative to another probability? Something else?

Me, for example, I always prefer probability contrasts, which is one probability minus another probability. I believe this is sometimes called a risk difference (though the jargon of “risk” is a poor fit for my discipline). Other folks, however, love those odds ratios.

1 Like

Thank you for your response, @Solomon! I truly appreciate you and this community, as I am learning Bayesian modeling alone without having anybody near me to ask such questions.

My initial idea was as an odds ratio.
By me stating

increases the odds of the outcome value by 20% or more

I intended to express the differences in odds ratio (1.2 times+) compared to a control condition (e.g., no intervention).

I am using the brms code something like as follows:
fit = brm ( outcome ~ intervention + pretest_outcome + (1|participant) + (intervention | item), family = "bernoulli",...)
where intervention was dummy coded as 0 for control and 1 for intervention. outcome is a correct/incorrect response on a cognitive task (e.g., math questions) after the treatment (intervention vs. no intervention). We did the same test for the outcome variable as a pretest beforehand, hence pretest_outcome in the model.

So I maybe should have stated like…?
I want to examine whether the intervention increases/decreases the ODDS of the outcome value (i.e., success/fail rate) 1.2 times or more/less compared to the control condition; if not, the intervention does not make a meaningful difference.

then, setting rope to be:
upper: ln(1.20) = 0.1823215568
lower: ln(0.8) = -0.2231435513
Am I understanding this right?


Me, for example, I always prefer probability contrasts, which is one probability minus another probability. I believe this is sometimes called a risk difference (though the jargon of “risk” is a poor fit for my discipline). Other folks, however, love those odds ratios.

Thank you for the suggestion. I am wondering if this corresponds to the comments made here (ROPE range specification for binary models)?

So, with this approach, we need to know the baseline odds to start with right? In my experiment, I do not know how well the control condition performs on the outcome value beforehand. Even in such a case, can I still use the probability contrasts?

I know it’s been a while, but if I could hear any ideas from @Solomon and/or anybody about my understanding and confusion of setting a ROPE range, it would be really appreciated! :)

I’d suggest following @andrewgelman’s advice and just report the posterior ranges rather than trying to reduce to a binary decision. If you have some downstream decision-making process, then you can compose the two with Bayesian inference without first reducing the posterior to a binary decision of significant/non-significant. That is, is there really that big of a difference between 19.9% and 20.1% that you want to treat them completely differently?

Usually people use log and ln interchangeably, with log being much more popular in programming languages and stats.

Thank you for your insight, @Bob_Carpenter !

I appreciate your confirming my understanding and setting of the ROPE range and suggesting that I avoid testing hypotheses dichotomously. I totally agree with you, and the reason I got so fascinated by Bayesian modeling in the first place was that I thought that I could interpret results in a more nuanced manner.

However, I am having difficulty dealing with the practical aspect of analysis in my research area to take full advantage of the posterior. If I may, I would like to ask a follow-up question about this.

Could you kindly let me know which reference (and the section name and/or page number, if possible) you would recommend I read into this? I have read Bayesian Data Analysis (Third edition) as well as other Bayesian books (Statistical Rethinking and Doing Bayesian Data Analysis), but I feel that I am not establishing the way I want to interpret the posterior distribution. However, I could be missing a lot.

In my research area, most studies still rely on the frequentist approach. So, even if Bayesian modeling is adopted, the “meaningfulness” of a predictor variable is judged simply based on whether or not the 95% Credible Interval includes 0.

Many people recommend setting a credible interval range based on the area’s knowledge, but there is no practical shared idea on the size of credible intervals in the area. Also, if we keep judging whether the credible intervals include zero or not, I feel that we cannot avoid the binary decision… This is probably because we need to (1) set a range that sounds familiar to the majority (i.e., 95%) and (2) judge the meaningfulness of predictors quickly based on a pre-determined range (either 95% or 89% or anything else, but decided beforehand to make ‘subjective’ judgment) because usually there are many predictors to examine in a study.

Could you suggest how to deal with such a situation?

I may be missing, but I feel that I have to have a pre-determined cut-off point to make a judgment. I could make a statement such as “the probability of a participant providing an accurate response on the final test being above 80% was 80%”, based on the posterior… but this seems to be fairly complicated as talking about the probability of the probability, so does not seem so appealing to reviewers.

I could describe the posterior with multiple ranges (e.g., 5%, 10%, 50%, 90%, and 95% CrI), but I am unsure how to make a nuanced judgment without a predetermined cut-off range, such as whether 90% CrI excludes 0…

I have been reading many articles and books on hypothesis testing in Bayesian and learning about the probability of direction and ROPE, and other stuff (e.g., Reporting Guidelines • bayestestR), but I feel I am still having difficulty taking the best advantage of bayesian modeling.

Any suggestions and references that I can learn and read into would be very much appreciated!

Look, for example, at figure 5.4 in BDA3 (you can get it free online from the book’s home page). It just reports the 95% intervals. McElreath is just using 89% to get across the idea that hte choice is arbitrary.

I’m not sure why you feel that a binary decision is inevitable. What’s the downstream task? Why report whether it’s non-zero at some interval width versus just reporting the median and 90% interval.

What kind of judgement are you trying to make and what kind of nuance are you worried about? Let’s say we estimate a regression coefficient and it’s 90% interval is (0.1, 1.8) and it’s 99% interval is (-0.2, 2.9). Report them both. Show a histogram. But why try to reduce to a yes/no decision about significance? For a start, these coefficients only make sense relative to the other coefficients. You’ll get different results about whether the “income” covariate is “significant” depending on which other covariates are available such as age, sex, education and zip code.

I think you need to get to the bottom of why you feel you need to make a judgement about some parameter’s marginal posterior interval containing zero or not. If it’s because a journal editor or advisor wants a p-value, then I doubt anything you do with Bayes is going to make them happy.

The thing to read is McElreath’s book, Statistical Rethinking—it’s aimed at exactly someone like you asking exactly these kinds of questions. He also has online videos and a lot of other teaching material.

1 Like

Thank you very much for your input, @Bob_Carpenter !
I truly appreciate you taking the time to educate someone like me who is trying to learn and use Bayesian modeling but is having difficulty discussing and asking for help.

I sincerely apologize for my delayed response. I’ve been reading the two books you mentioned, ‘BDA3’ and ‘Statistical Rethinking,’ trying to better understand your response. (Honestly, BDA3 was too advanced for my current knowledge level, although many sections seemed very helpful. I plan to revisit the book in the future to deepen my understanding. And I am still doing Statistical Rethinking.)

This question made me think more deeply about what puzzles me. My paper was once rejected by a reviewer who commented that the manuscript lacked p-values (among other comments). The Bayesian approach isn’t common in my field.

Also, Studies typically include multiple predictors, such as ‘intervention status,’ ‘test timing,’ ‘testing formats,’ ‘age,’ ‘gender,’ ‘background,’ ‘context,’ and generally the interactions between these predictors. I’ve been unconsciously thinking, “It would be easier and more straightforward to say, ‘The intervention significantly enhanced posttest scores, and there was a significant interaction between intervention and gender. Meanwhile, no effects were observed for test timing and formats.’”

Given this multiple-predictor practice, I find it challenging to thoroughly discuss the posterior distribution within space constraints. I would appreciate learning any better strategies for handling such situations.

Thank you for your clear suggestion!

Following your suggestion, would my report on the estimated coefficient of ‘intervention’ look something like this:

The estimated Odds Ratio for ‘intervention’ was 1.20, 95% CrI [0.90, 1.50], 90% CrI [1.01, 1.41], 50% CrI [1.10, 1.30].

The estimated OR for ‘gender’ (reference = Male) was 0.9, 95% CrI [0.40, 2.5], 90% CrI [0.90, 1.80], 50% CrI [0.85, 0.95].

In this case, can I interpret these results as follows: the intervention increased the OR by a factor of 1.2 (compared to no intervention), and we are 90% confident that the effect is meaningfully positive because we are 90% confident that the OR exceeds 1.

Regarding gender, females scored 10% less accurately than males, and we are 50% confident that females’ mean test scores were lower than males. However, we are 90% confident that the OR can be positive, indicating that there is a good chance that males can be less accurate than females.

I’m not entirely confident about my wording or clarity, but am I on the right track?
How would you interpret the above results and describe your interpretation?

When studying frequentist statistics, I learned that the alpha level should be set before the study and shouldn’t be changed after obtaining results (hence the conventional p < 0.05). When evaluating predictors’ usefulness based on various credible intervals (e.g., 50%, 80%, 90%, 99%), I worry that I’m changing the judgment criteria, which feels incorrect. Given the interpretation of Bayesian confidence intervals, would this approach be acceptable?

Additionally, in BDA3, posterior distributions often seemed to be summarized with 50% and 95% Credible Intervals. For example, Figure 16.6 Anova display for two logistic regression models of the probability that a survey respondent prefers the Republican candidate for the 1988 U.S. presidential election, based on data from seven CBS News polls ”50% intervals, and 95% intervals of the finite-population standard deviations sm.” However, their interpretation of the figure was more casually interpreted instead of going on each variable individually. Is this something you would be recommending? Or is this just a casual explanation for the sake of illustration in the book, and would you rather recommend something more detailed??

An approach that might be useful for you is to do posterior predictive simulation - draw samples from the posterior distribution of your outcome variable conditional on different predictors being set to certain levels. Then you can plot these to illustrate the predicted implications of various predictor values, or combinations of predictor values. Often this will give a much better understanding of what pattern has actually been found in the data than reporting lots of coefficients. As @Bob_Carpenter points out, their interpretation is model dependent anyway. Here’s an example:

Figure 2 | Informant Discrepancy in Report of Parent-adolescent Conflict as a Predictor of Hopelessness among Depressed Adolescents: A Replication Study | Journal of Child and Family Studies

This is from a study of differences in reporting of conflict between depressed adolescents and their parents as a predictor of hopelessness. As you can see, we’ve used colouring to show the parts of the distribution of the outcome variable falling within what’s conventionally considered a severe level of hopelessness (a perceptive viewer will see the symmetry in the plots - we did not estimate an interaction between conflict level and discrepancy, but the model fitted rather well across the range of conflict reported).

Anyway - I think this is a much better way to communicate whether a predictor matters or not than looking at variants of statistical significance, and very intuitive to readers too.

1 Like

+1 to that suggestion. But it gets hard to compare models that way—you wind up with a ton of noisy simulation spaghetti for all the variants and they’re hard to compare visually.

What you can do is posterior predictive checks. These are the Bayesian analogue of chi-squared goodness of fit tests for regressions.

That happens. There’s not really a Bayesian solution to this problem, though. You just need more open-minded reviewers. @mitzimorris was once asked to compute p-values for an application of latent Dirichlet allocation for a Science paper, which is not only pointless for exploratory data analysis, but also hopelessly intractable.

These are all frequentists statements about hypothesis Tess on regression coefficients.

You’re not going to fit a Bayesian model and get zero effect on real data, because posteriors don’t collapse to points. What you might get is an effect that’s deemed to be not significant at some level (i.e., 0.05) according to a frequentist hypothesis test.

If this is what you want to do, you’re looking at the wrong software. I’d suggest looking at something like lme4 in R as an alternative that will let you do this.

Of course. If you have 5 predictors, then there are 32 models to consider even without interactions. If you include interactions as well as self interactions (to make a quadratic transform), then the combinatorics get totally out of hand at 2^(5 + (5 choose 2)) = 32K. You’ll run out of space considering every model. So instead people do something greedy that’s not guaranteed to get the optimal set of predictors, for example, evaluating them for significance one at a time. Or you can use shrinkage like the lasso, but you have the same problem you have with Bayes trying to explain what happens with different penalty weights.

That’s fine, but you can also plot. Plus, you’re going to need to say what the estimation procedure was. We usually take posterior means as they minimize expected square error, but you can also use medians which minimize expected absolute error.

I wouldn’t use the word “confident” as it has a technical meaning in stats around confidence intervals. You can say that conditoned on the data and model, we estimate a 90% probability that OR is positive.

I’d keep going with McElreath’s book as it’s all about this kind of thing.

You’re just giving a better picture of what the actual posterior looks like. None of these interval are criteria, they’re just telling you about posterior uncertainty in parameters given observed data (relative to the model, of course).

I’m afraid I didn’t understand. You can’t interpret coefficients independently. The meaning of a regression coefficient is relative to keeping all the other regression coefficients as they are. If you change the other covariates (predictor), then you’ll get different estimated coefficient. For instance if I include income and fit a coefficient, then include income and savings, the coefficient for income changes. So there’s no saying things like “income is significant” in an absolute sense—it’s always relative to the rest of the regression.

If you want something long, Gelman et al.'s Regression and Other Stories goes over a lot of this. I think the book’s available for free as a pdf, but I’m not sure.

2 Likes

Thank you very much, @erognli and @Bob_Carpenter, for your reply to my question!

Dear @erognli ,
I appreciate you letting me know about a paper showing that coefficients can also be interpreted with a posterior predictive simulation. I will learn more about this and try using it when focusing on one (or a few) parameter(s) and analyzing them in detail.

I also liked that the paper used 66% and 90% Credible Intervals to describe posteriors while using the terms “likely” and “very likely.” I think such a way of communicating the posteriors is clear, straightforward, and in line with what I am learning from @Bob_Carpenter and other readings.

Dear @Bob_Carpenter ,
Thank you for your detailed response to my question. I feel I understand your suggestions and Bayesian interpretation much better now.
It is very scary to hear that even established researchers face difficulties communicating the Bayesian approach to reviewers (even with Science!!).

I assume you may be referring to Bayesian model stacking or something similar. I have only heard of its name, but I feel that the dots started connecting now. I will keep learning about these.

Thank you for checking my understanding and my language aiming to describe posteriors! This is one (of the many things) that I really wanted to consult with someone for a long time. As using English as my second language, receiving feedback on my description from an expert is truly helpful! I also appreciate you reminding me of the importance of providing plots of the posteriors and considering that parameter estimates and their CrIs are based on the data at hand and the structure of the fitted model.

After receiving your replies, I started searching for papers to see how they interpret and report Bayesian estimations.
Alongside the paper @erognli suggested for me, I found out that a decent amount of people cited UN IPCC guidance (https://www.ipcc.ch/site/assets/uploads/2017/08/AR5_Uncertainty_Guidance_Note.pdf)

Table 1. Likelihood Scale (adopted from p.3)

Term Likelihood of the Outcome
Virtually certain 99-100% probability
Very likely 90-100% probability
Likely 66-100% probability
About as likely as not 33 to 66% probability
Unlikely 0-33% probability
Very unlikely 0-10% probability
Exceptionally unlikely 0-1 probability

(see also. https://www.ipcc.ch/site/assets/uploads/2018/02/WG1AR5_SPM_FINAL.pdf)

@Bob_Carpenter , Do you think making an evaluation based on such a category can be reasonable for posteriors of Bayesian estimation?

I wonder if I can evaluate the credible intervals in the following manner? It seems to be fairly straightforward (while I am highly aware that there is a risk that it could lead to automatized no-thinking judgment if used thoughtlessly)

Whether the Odds Ratio of a treatment’s CrI includes 1 or not:
(* given the data and fitted model, while the effects of other predictors are held constant)

66% CrI does NOT include 1 (but 90% CrI does), and OR is positive:
The treatment likely increases the odds of accurate response (but with 90% probability, the odds ratio ranges from -0.5 to 1.5, suggesting substantial uncertainty in the relationship between treatment and response accuracy).

90% CrI does NOT include 1, and positive:
The treatment very likely increases the odds of an accurate response. “We estimate a 90% probability that OR is positive.”

66% CrI includes 1:
There is great uncertainty about whether the treatment increases or decreases the accuracy of response.

Your approach seems sensible enough to me, given your purpose as far as I understand it, but others may disagree.

I think it’s useful here to be careful about the difference between how to understand the implications of the posterior distribution, useful ways of communicating about the posterior distribution, and decision rules for claiming scientific succes or truth.

I find it useful to remind myself that that the posterior distribution is the answer to the question I have posed by collecting the data and fitting the model (given that computation hasn’t failed me). If it’s not clear to me what that answer means, I usually find that I’ve not been thinking clearly and carefully enough about what question I am actually asking.

When you know both your question and understand the answer, the remaining task is communicating both to others. I think your approach can be a useful way to communicate, but along with transparency about both the model and the posterior, so others can make up their own mind about what the answer means.

2 Likes

Not quite. I’m referring to a way of selecting predictors in a traditional frequentist way. That is you run a regression and look at which parameters are estimated as significantly different from zero.

No. I would steer clear of these tables and descriptive terms for them. The numbers are the numbers, but the words with them and the cutoffs are too vague to be useful. For example, I wouldn’t describe 2:1 odds as “about as likely as not” because it’s twice as likely than not at 66%. Ditto for the whole Bayes factor thing.

Or as @erognli put it so well,

Usually, we reduce that to summaries using expectations. I’d just stick to doing what Andrew Gelman and @erognli are recommending:

1 Like

Dear @erognli and @Bob_Carpenter ,
Thank you very much for your educating me on this.

Now, I feel I have a clearer view of how to interpret and report posteriors.
So, instead of just dichotomously categorizing the results as "significant’ or “not significant,” we can take a look at the posterior to see where the point estimate is most likely to be located and how much uncertainty we have about the estimate. Visual plots help us communicate these directly. Credible intervals may better be interpreted with their probability related to the estimation rather than focusing on an arbitrary cut-off point predetermined by somebody.

I will keep learning about these and hope to publish research using Bayesian modeling, ideally soon!!
Thank you again, and I hope you two have a great day today.

1 Like

That’s not quite right. The posterior isn’t about point estimates—it represents the complete uncertainty in the parameters given the data. For example, it gives you all the quantiles of uncertainty, e.g., the parameter is 95% probable to be less than a given value.

There are three traditional approaches to point estimation for Bayesian models, only two of which involve Bayesian inference (1 and 2):

  1. posterior mean—this minimizes expected squared error in estimates (conditional on the model being correct)
  2. posterior median–this minimizes expected absolute error in estimates (also conditioned on model correctness)
  3. posterior mode (aka max a posteriori, aka MAP)—this doesn’t have an error based interpretation and is not a Bayesian procedure

All 3 will converge to the true value as data increases in well-behaved parametric models (i.e., ones where the parameters don’t grow along with the data).

Standard error is the measure of uncertainty in estimation. This goes down to zero as the sample size goes to infinity. The posterior standard deviation, on the other hand, will not shrink with more samples, but it will be more accurately estimated. The standard deviation is your uncertainty in the parameter, the standard error is the uncertainty in the posterior mean estimate.

Thank you for following up to double-check my understanding, @Bob_Carpenter . As I would like to communicate my Bayesian interpretation accurately and clearly, I really appreciate it.

Reading your explanation of this part, I thought this could have been one important piece I had missed. I thought the posterior standard deviation goes down to zero as the sample size goes to infinity in the same (or similar) manner as SE in frequentist statistics.

For example, I have learned that probability of direction is strongly correlated to p-values (Probability of Direction (pd) • bayestestR). If this is the case, won’t the posterior get thinner and thinner as the sample size increases?

I also have difficulty understanding what exactly the posterior standard deviation will be more accurately estimated means. If the posterior SD is about uncertainty in the parameter, shouldn’t uncertainty decrease as the sample size gets closer to covering the target population? Would there be something like true accurate uncertainty?

I can understand if a parameter is about predictions generated based on a model, then there would be an “accurate uncertainty (or variance)” to the estimation. However, when a parameter is about the mean of a specific group’s test scores (—which is usually the case), there shouldn’t be a variance (i.e., only a true parameter value and its posterior probability), right?

One confusing thing here is mixing up the uncertainty in MCMC approximation of the posterior distribution and the uncertainty about parameters that the posterior distribution quantifies. In both cases uncertainty decreases with larger samples, but with MCMC samples and data respectively.

So you can have a regression model with a large sample, and run very short MCMC chains to get an uncertain estimate of a posterior distribution with a small SD. Or you could have a tiny sample and run MCMC for ages to estimate very precisely a posterior distribution with a huge SD.

Another issue specific to a regression model is of course the error term, which will remain as it is irrespective of both sample sizes as long as there are sources of variation that aren’t included in your model.

I guess it differs with application and field what the main challenge for inference is, for my work its usually getting enough observations, but of course there are cases and fields where computation is usually the limiting factor.

2 Likes

@Bob_Carpenter @a_t @erognli I think there is a huge disconnect between Bayesian theorists who write and develop theories, and what we as researchers really need and look for. We want to use Bayesian because it is much more flexible, allows for better use of uncertainty and priors (and all the advantages that do not need to be enumerated here). However, in the end it all comes down to a binary decision: is the treatment effective? (a regulator will only have 2 options based on our results: either approve the drug on the market, or reject it, based on the estimated efficacy). Does the drug cause heart damage? (this comes down to a yes or no, regarding approval). And this is what we want to find out using Bayesian techniques. Reporting CrI or any other metric ultimately has to lead to the inevitable final decision, and the goal is to make this as accurate as possible. For me, one of the techniques that we would most need to develop in software would be the loss function (because rejecting an effective drug is not as bad as approving a safe one). By mapping the value of each possibility to a utility function, we could find out whether considering the treatment (or your outcome) as significant gives a positive or negative expected utility, and we would proceed if it is positive since we would seek to maximize expected utility.

I don’t know that I buy how you’ve framed the issue. Though I can appreciate that decision makers need to make decisions, it’s not clear to me that it’s the job of a model to do that for them, and in my capacity as a substantive researcher, this is not what I’m generally looking for either. When I work on RCTs, effect sizes with 95% intervals suit my colleagues and I just fine. However, since you bring up utility functions, you might like the texts by Berger (Statistical Decision Theory and Bayesian Analysis | SpringerLink) or Robert (https://www.harvard.com/book/9780387715988).

I don’t consider myself a theorist!

Sometimes that’s true. But usually, in my experience working with practitioners, that’s never been the goal. Usually we also want to know how effective a treatment is for different subsets of the population. We want to learn appropriate dosing for children and adults. We want to measure potential side effects. There are all kinds of goals when studying a new medical treatment.

And there are all sorts of applications of Bayes that aren’t medical.

My understanding is that regulators (and doctors and patients) are willing to accept some potentially serious side effects for drugs if their main effects justify it. For instance, we know chemotherapy causes damage, but we do it anyway, because the cancer is worth. We prescribe blood pressure meds even though we know they have side effects on vitamin levels. We prescribe blood thinners to prevent strokes even though they seriously increase the brain damage from falls and many of the elderly patients on blood thinners are high fall risks (I know this from my father’s case—he died from complications of a fall when on heavy blood thinners to prevent stroke from arrhythmia).

Like @Solomon, that has not been my experience. At least in the United States, Phase I clinical trials are very different than Phase II and Phase III. In Phases II and III the techniques are very constrained, whereas in Phase I the researchers are trying to learn as much as they can about a drug, not just make an “effective/not-effective” binary decision. For example, that was how we worked with Novartis on a Phase I clinical trial in the very first industrial project we got involved with for Stan.

What’s Crl?

This is exactly what Bayesian decision theory does—we define a utility function for different events or parameter values. In Stan, you can code that up in generative quantities as a posterior predictive inference.

Maximizing expected utility is tricky in that you need to build things into your utility function to deal with risk. By itself, flipping a coin for double or nothing on your life savings plus $1 has a positive expected utility ($1), but we usually won’t do it because of the risk.