Regression model question

I have a regression model but I get very different answers depending on part of the input and I am asking about your insight as to why:

\begin{align} y &\sim \text{bernoullilogit}(\alpha + \beta x) \end{align}

Now let’s say that I generate x as follows, using the CDF of gamma:

\begin{align} v &\sim \text{Gamma}(16,0.2) \\ x &= \text{GammaCDF}(v|16,02) \\ y &\sim \text{BernoulliLogit}(1+3 x) \end{align}

the model works fine and I get what I would expect, for example with 1000 observations:

 variable    mean  median   sd  mad      q5     q95 rhat ess_bulk ess_tail
    lp__  -304.63 -304.31 1.02 0.72 -306.75 -303.67 1.00     1316     1750
    alpha    0.83    0.83 0.18 0.18    0.54    1.13 1.00     1330     1454
    beta     3.31    3.30 0.46 0.45    2.57    4.10 1.00     1236     1143

HOWEVER, if I use the PDF instead of the CDF of Gamma:

\begin{align} v &\sim \text{Gamma}(16,0.2) \\ x &= \text{GammaPDF}(v|16,02) \\ y &\sim \text{BernoulliLogit}(1+3 x) \end{align}

Now the I get something where the uncertainty around \beta is huge and actually does not change with increasing sample size:

> fit$print()
 variable    mean  median    sd   mad      q5     q95 rhat ess_bulk ess_tail
    lp__  -593.00 -592.72  0.97  0.70 -594.98 -592.08 1.00     1494     1861
    alpha    0.46    0.46  0.18  0.18    0.15    0.75 1.00      689     1034
    beta    33.25   32.89 12.11 12.27   13.96   53.46 1.00      696      877

Why is this?


Stan code:

data {
  int<lower=0> N;
  vector[N] x;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real alpha;
  real beta;
}
model {
  y ~ bernoulli_logit(alpha + beta * x);
}

That’s because the cdf value is basically summing all the density from 0 to x and so the value is bigger. To compensate for the lower value from the PDF then beta must increase.

2 Likes

Thank you @spinkney, but wait, I generate fake data with known values: generate the data using the PDF (low values obviously) and generate a \beta, etc. Why isn’t this recovering the \beta properly for generated low values of x but doing so with higher values of x. It’s not even close.

What transformation would be appropriate?

It shouldn’t matter how you collect x values, you should be able to fit with enough data. But binomial data is very weak. So I would double check to make sure the data’s getting generated in a way that matches the models.

I plotted the x values you get from simulating from your generative process. As you would expect, it’s uniform if you sample (rgamma) then take the cdf (pgamma), and it’s not if you sample (gamma) and then take the pdf (dgamma):

May I ask what the motivation is for using either a cdf or pdf and how it is that both might work? It looks like it’s just adding noise. The cdf will be monotonic, but the pdf won’t.

I would go back to the code to make sure the data simulation matches the model. Feel free to share it here.

1 Like

I realize I used a 0.2 rather than 2 (the 02 notation in the post confused me). I thought the easiest thing to do would be to simulate to show you that it works as expected. Bottom line:

=== N = 256 ===

-- x mode: cdf --
alpha  90% interval: [-0.158, 1.110]
beta   90% interval: [3.294, 7.119]
beta interval width:  3.825

-- x mode: pdf --
alpha  90% interval: [0.923, 2.412]
beta   90% interval: [-96.184, -2.224]
beta interval width:  93.959


=== N = 1024 ===

-- x mode: cdf --
alpha  90% interval: [0.837, 1.426]
beta   90% interval: [2.146, 3.637]
beta interval width:  1.491

-- x mode: pdf --
alpha  90% interval: [0.566, 1.171]
beta   90% interval: [-2.650, 37.312]
beta interval width:  39.962

You can see that with 1024 data points, the posterior intervals do indeed shrink in both cases as one would expect with a regression. By the time you get N = 1024, the true value (3) shows up in both posterior 90% intervals.

Here’s the prompt I used to generate the code:

I would like you to simulate data and fit that data using cmdstanpy for two different models with 1000 simulated data points. 

Data generation 1: v ~ gamma(16, 0.2); x = gamma_cdf(v | 16, 0.2) 

Data generation 2: v ~ gamma(16, 0.2); x = gamma_pdf(v | 16, 0.2) 

So in one case, I simulate and take the cdf and in another, I simulate and take the pdf. In case 1, it's just generating uniform(0, 1) variates. 

Then after generating N=256 x values in both of these situations, I want you to generate y values for both according to: y ~ bernoulli_logit(alpha + beta * x) where alpha = 1 and beta = 3. 

We will now have (x1, y1) and (x2, y2). I want you to fit both of these models with a Stan model with improper uniform priors and likelihood y ~ bernoulli_logit(alpha + beta * x). 

I then want you to fit both data sets to the Stan model (we only need one Stan model here) and print a summary of the results. Then I want you to generate new data sets with N=1024 rather than N = 256 and fit those. I'm looking to see if the posterior interval around beta shrinks as the data increases. 

Please give me the Python code to do this.

And here’s the Python that GPT 5.2 plus generated today.

"""
Simulate two x-generating mechanisms from a Gamma(shape=16, rate=0.2) draw v:

(1) x = GammaCDF(v)  -> x ~ Uniform(0, 1)
(2) x = GammaPDF(v)  -> x is a density value (positive, typically small)

Then simulate y ~ BernoulliLogit(alpha + beta * x) with alpha=1, beta=3,
fit a logistic regression in Stan with improper uniform priors (i.e., no priors),
and compare posterior intervals for beta at N=256 and N=1024.
"""

from __future__ import annotations

import os
import tempfile
from dataclasses import dataclass
from typing import Dict, Tuple

import numpy as np
from cmdstanpy import CmdStanModel
from scipy.stats import gamma as gamma_dist


def inv_logit(z: np.ndarray) -> np.ndarray:
    """Numerically stable logistic function."""
    z = np.asarray(z, dtype=np.float64)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)
    return out


def simulate_xy(
    n: int,
    x_mode: str,
    alpha: float,
    beta: float,
    rng: np.random.Generator,
    shape: float = 16.0,
    rate: float = 0.2,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Simulate (x, y) where v ~ Gamma(shape, rate), x is either CDF(v) or PDF(v),
    and y ~ BernoulliLogit(alpha + beta * x).
    """
    if x_mode not in {"cdf", "pdf"}:
        raise ValueError("x_mode must be 'cdf' or 'pdf'.")

    scale = 1.0 / rate
    v = gamma_dist(a=shape, scale=scale).rvs(size=n, random_state=rng)

    if x_mode == "cdf":
        x = gamma_dist(a=shape, scale=scale).cdf(v)
    else:
        x = gamma_dist(a=shape, scale=scale).pdf(v)

    p = inv_logit(alpha + beta * x)
    y = rng.binomial(n=1, p=p, size=n).astype(np.int32)
    return x.astype(np.float64), y


def write_stan_model(stan_path: str) -> None:
    """Write a logistic regression Stan program with improper uniform priors."""
    stan_code = r"""
data {
  int<lower=1> N;
  vector[N] x;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real alpha;
  real beta;
}
model {
  // Improper uniform priors: no prior terms added.
  y ~ bernoulli_logit(alpha + beta * x);
}
"""
    with open(stan_path, "w", encoding="utf-8") as f:
        f.write(stan_code)


@dataclass(frozen=True)
class FitSummary:
    alpha_q025: float
    alpha_q975: float
    beta_q025: float
    beta_q975: float
    beta_width: float


def fit_and_summarize(
    model: CmdStanModel,
    x: np.ndarray,
    y: np.ndarray,
    seed: int,
    chains: int = 4,
    iter_warmup: int = 1000,
    iter_sampling: int = 1000,
) -> Tuple[object, FitSummary]:
    """Fit the Stan model and return CmdStanMCMC plus beta interval width."""
    data = {"N": int(len(x)), "x": x, "y": y.tolist()}

    fit = model.sample(
        data=data,
        seed=seed,
        chains=chains,
        parallel_chains=min(chains, os.cpu_count() or 1),
        iter_warmup=iter_warmup,
        iter_sampling=iter_sampling,
        show_progress=False,
    )

    summ = fit.summary()
    alpha_q025 = float(summ.loc["alpha", "5%"])
    alpha_q975 = float(summ.loc["alpha", "95%"])
    beta_q025 = float(summ.loc["beta", "5%"])
    beta_q975 = float(summ.loc["beta", "95%"])
    beta_width = beta_q975 - beta_q025

    return fit, FitSummary(alpha_q025, alpha_q975, beta_q025, beta_q975, beta_width)


def run_experiment(
    model: CmdStanModel,
    n: int,
    rng: np.random.Generator,
    alpha_true: float = 1.0,
    beta_true: float = 3.0,
) -> Dict[str, FitSummary]:
    """Simulate and fit both x-modes at a given n."""
    results: Dict[str, FitSummary] = {}
    for x_mode in ["cdf", "pdf"]:
        x, y = simulate_xy(n=n, x_mode=x_mode, alpha=alpha_true, beta=beta_true, rng=rng)
        _, fs = fit_and_summarize(
            model=model,
            x=x,
            y=y,
            seed=int(rng.integers(1, 2**31 - 1)),
        )
        results[x_mode] = fs
    return results


def main() -> None:
    rng = np.random.default_rng(20260116)

    with tempfile.TemporaryDirectory() as td:
        stan_path = os.path.join(td, "logit_improper.stan")
        write_stan_model(stan_path)
        model = CmdStanModel(stan_file=stan_path)

        for n in [256, 1024]:
            res = run_experiment(model=model, n=n, rng=rng)

            print(f"\n=== N = {n} ===")
            for mode in ["cdf", "pdf"]:
                fs = res[mode]
                print(f"\n-- x mode: {mode} --")
                print(f"alpha  90% interval: [{fs.alpha_q025:.3f}, {fs.alpha_q975:.3f}]")
                print(f"beta   90% interval: [{fs.beta_q025:.3f}, {fs.beta_q975:.3f}]")
                print(f"beta interval width:  {fs.beta_width:.3f}")

        print("\nNote: cmdstanpy also provides full summaries via fit.summary().")


if __name__ == "__main__":
    main()
2 Likes

Thank you very much @Bob_Carpenter! So the situation is that I have a logistic regression:

y \sim \text{bernoullilogit}(\alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3)

the x_1 and x_2 correspond to values between zero and one, one corresponding to P(y=1). However, x_3 is something that comes from a set of values, say \{d_i\} that follow a gamma distribution:

I was looking for a transformation f() such that x_{3,i} = f(d_i) that would be appropriate for the logistic regression above. Does this make sense?

Thanks, I will recheck this…

If you have a prior for \Pr[Y = 1], then I’m not sure this is the best way to use it. In a regression setting, I’d recommend applying a log odds transform to \textrm{logit}(x_3) = \log \frac{x_3}{1 - x_3}, which will make everything symmetric around 0. That is, assuming there are no boundary values. If there is a boundary value, then I’d think you’d want to make sure y was equal to that value.

Whenever you have an intercept and all the covariates of the same sign, you induce a lot of correlation in the posterior between the intercept and the regression coefficients. One way to deal with this is to standardize the covariates or otherwise transform them. If you have covariates in (0, 1) you can apply a log odds (logit) transform to map onto (-\infty, \infty). If you have covariates in (0, \infty) you can use log to transform (or soft plus). Then once you’ve transformed to unconstrained parameters, you can standardize by subtracting the mean and dividing the by the standard deviation. In fact, you can try that with and without mapping to unconstrained first.

How the covariates were generated doesn’t affect a regression—the posterior is conditionally independent of the distribution of the covariates. That’s why you don’t usually worry about modeling the covariates unless there’s missing data.