Bias from improper priors in regression

I’m having issues with the following model, where points are distributed at various distances across a 3d space, up to some hard cutoff in the observed x-values. The parameter of interest is the slope between two sets of distance observations. However the noise in the two sets of observations have different properties, one lognormally distributed, and one normally distributed. Since the original points are volume distributed, the prior probability goes as distance squared.

I’m having issues where when I simulate data using the process specified above, and run the model on the simulated data, the recovered slope is badly biased. The bias seems to depend on the relative values of the uncertainties in the different covariances.

My first thought was that the prior was contributing extra factors of the slope to the posterior probability, but it seems that when I integrate the posterior analytically over all possible observed x, observed y, and true x values the posterior distribution of the slope is the same as the prior.

Any help in figuring out why I’m not able to recover the simulated parameters and how to fix it would be appreciated!

data{
    int numdata;
    vector[numdata] xobs;
    vector[numdata] logyobs;
    matrix[numdata,numdata] ycov;
    matrix[numdata,numdata] xcov;
}

parameters{
    vector<lower=0>[numdata] xtrue;
    real logslope;
}
transformed parameters{
    real slope=exp(logslope);
}

model{
    logslope~normal(4,1);
    logyobs~ multi_normal(log(xtrue/slope),ycov);
    xobs ~ multi_normal(xtrue,xcov);
    target+=log(xtrue)*2;
}

Are you sure you didn’t mean this to be negative? In that you intended a Pareto distribution?

The prior you have above pushes xtrue toward large values rather than penalizing large values. In general, we’d recommend writing priors using distributions where possible to avoid these kinds of issues.

P.S. I’d also recommend avoiding manual transforms and adding relevant constraints. That’d be:

cov_matrix[numdata] xcov, ycov;

real<lower=0> slope;

slope ~ lognormal(4, 1);

That puts the prior median at 50 and the prior 95% interval on slope at (7, 400).

This can also be problematic

xobs ~ multi_normal(xtrue, xcov);

With xtrue being the same size as xobs, this is just going to push the xtrue values to match the xobs values. Was that the intention?

The observed data points are uniformly distributed throughout a 3d volume (in the real data, these are galaxies distributed throughout the local universe). That means that in spherical coordinates, with number density per unit volume n(r,\theta,\phi), that the prior probability for their true distances is

P(r) d^3V = n(r,\theta,\phi) r^2 d\Omega dr \propto r^2 dr

Thus why the prior pushes xtrue to large values; there are more galaxies further away from us than there are nearby.

This prior is then improper; the reason my sample is finite is because I select the data by requiring that to be part of the sample the points must have xobs less than some threshold.

This is part of the likelihood; xobs and logyobs are two sets of distance measurements from different techniques, and the true distances of these objects should (ideally of course) be related to both of the measured distances.

An example showing what this selection looks like for the simulated data;
temp

The observed values of xobs increases in density up to the cutoff value, while some observations in truth further away scatter down into the sample.

Got it. Thanks for verifying.

OK, I see. So this is like a noisy measurement model for xobs. I’m afraid I don’t have any more ideas on how to fit this better. The prior was to pull xtrue as large as possible and I have no experience working with improper priors with that kind of pull.

P.S. I’d replace log(xtrue/slope) with log(xtrue) - logslope. And for efficiency, you really want to do this:

transformed data {
  matrix[numdata, numdata] L_xcov = cholesky_decompose(xcov);
...
model {
  jobs ~ multi_normal_cholesky(true, L_xcov);

and same for y. That saves having to factor xcov and ycov every log density eval. But none of that is going to help with the statistical issue.

I strongly suspect that the bias you are seeing is due to the censoring edit: truncation of the xobs, and the resulting soft censoring truncation of xtrue, neither of which is accounted for in the likelihood.

Since you are simulating data, it seems as if you have a generative model for this system. Can you write a Stan program that encodes this generative model? The program you are working with now does not encode this generative model; it doesn’t account for the censoring, and attempting to simulate from this model directly immediately runs into problems because there is no way to simulate the distances xtrue, as their distribution is improper and pushes probability mass towards arbitrarily large values.

I disagree. The two truncation effects you mention work in opposite directions and in fact exactly cancel each other out. In other words

  1. Truncation changes the prior for x_{\text{true}} by multiplier P\left(x_{\text{obs}} < x_{\text{max}}\mid x_{\text{true}}\right) because x_{\text{obs}} > x_{\text{max}} is no longer included in the model.
  2. Conditional on truncation, the likelihood of x_{\text{obs}} needs correction \frac{1}{P\left(x_{\text{obs}} < x_{\text{max}}\mid x_{\text{true}}\right)} to ensure the probability still integrates to one.

A quick Simulation-Based Calibration run does not show much bias.

import cmdstanpy
import numpy as np, numpy.random as npr
import matplotlib.pyplot as plt

model = cmdstanpy.model.CmdStanModel(stan_file="/home/nhuurre/space.stan")

# simulate data with diagonal covariance
def simulate(numdata, sigma_x, sigma_y, max_xobs=25):
    xcov = sigma_x*np.identity(numdata)
    max_xtrue = max_xobs + 10*sigma_x
    xobs = np.zeros(numdata)
    i = 0
    while i < numdata:
        x = (npr.uniform(0, max_xtrue, size=3)**2).sum()**0.5
        if x > max_xobs:
            continue
        xobs[i] = x
        i += 1
    ycov = sigma_y*np.identity(numdata)
    logslope = npr.normal(4,1)
    logyobs = npr.multivariate_normal(np.log(xobs) - logslope, ycov)
    return dict(numdata=numdata, xcov=xcov, ycov=ycov, xobs=xobs, logyobs=logyobs), logslope

rank = np.zeros(41)
NSIM = 500
for _ in range(NSIM):
    data, logslope = simulate(100,1,1)
    draws = model.sample(data=data, thin=100, show_progress=False).draws_xr()
    # number of posterior draws that are smaller than the true value
    n = np.sum(draws['logslope'] < logslope)
    # rank histogram
    rank[n] += 1

plt.figure(figsize=(6,8))
axes = plt.subplot(2,1,1)

# histogram
axes.bar(np.arange(41), rank)

axes = plt.subplot(2,1,2)
x = np.linspace(1/42,1,41)
p = np.array([1/41] * 41)
for _ in range(100):
    # ranks are multinomial if the posterior is unbiased
    axes.plot(x, np.cumsum(npr.multinomial(NSIM,p))/NSIM - x, c='k', alpha=0.05)
for _ in range(5):
    # highlight a couple for easier comparison
    axes.plot(x, np.cumsum(npr.multinomial(NSIM,p))/NSIM - x, c='k', alpha=0.3)

# ecdf curve
axes.plot(x,np.cumsum(rank)/NSIM - x, c='k')

2 Likes

Nice one!

1 Like

Thanks for taking a look! I’d gone through that same logic when I was coming up with the model.

The bias that I see scales strongly with the covariance in x_\mathrm{obs}, particularly when many elements are correlated. It seems like the likelihood then fails to provide a useful constraint on the overall scale of the problem, and Stan becomes free to just make all the elements larger than they actually are be.

This is the code I was using to generate simulated data;

import numpy as np
from scipy import stats,linalg

xcov=np.random.random((35,35))-0.5
xcov=(np.dot(xcov,xcov.T)*10+np.diag(np.ones(35)))/10
ycov=np.random.random((35,35))-0.5
ycov=(np.dot(ycov,ycov.T)+np.diag(np.ones(35)))/100

rmax=50


def generatesimdata(xmax,xcov,ycov):
    dists=np.empty(shape=(0))
    deltax=np.empty(shape=(0))
    errors=0
    for i in (range(ycov.shape[0])):
        dists=np.append(dists,np.array([np.nan]))
        deltax=np.append(deltax,np.array([0]))
        xobs=1e100
        while xobs>xmax:
            try:
                dists[-1]=rmax*(np.random.random(1))**(1/3)
            
                #Solve for variance and mean of the last x value, given the values of the rest
                var=   1 / (linalg.cho_solve(linalg.cho_factor(xcov[:i+1,:i+1]),  (deltax.size-1)==np.arange(deltax.size) ))[deltax.size-1]
                mean =  deltax[-1] -  (var * linalg.cho_solve(linalg.cho_factor(xcov[:i+1,:i+1]),deltax))[-1];

                deltax[-1]=np.random.normal(mean, np.sqrt(var))
                xobs=dists[-1]+ deltax[-1]
            except np.linalg.LinAlgError:
                errors+=1
                continue

    xobs=dists+ deltax
    trueslope=np.exp(np.random.normal(4,1))
    logyobs=stats.multivariate_normal( np.log(dists/trueslope),  ycov).rvs()
    
    return { 'numdata': xobs.size,'logyobs':logyobs, 'xobs':xobs,'ycov':ycov,'xcov':xcov,'true_xtrue':dists,'true_slope':trueslope}

I don’t think so; if the likelihood fails to constrain then Stan would just sample from the prior and rank histogram generated by Simulation-Based Calibration would still appear unbiased.

The problem is most likely incorrect implementation of truncation in the data generating method. You can’t truncate a multivariate distribution by truncating the conditionals one-by-one. Here’s an alternative method that I believe gives the correct result (but numdata is now a random variable so xcov must be significantly larger than the observed dataset.)

def simulation(xmax, xcov, ycov):
    N = xcov.shape[0]
    max_xtrue = xmax + np.einsum('ii', xcov)/N
    x = (npr.uniform(0, max_xtrue, size=(3,N))**2).sum(axis=0)**0.5
    xobs = npr.multivariate_normal(x, xcov)
    sel = xobs < xmax
    x = x[sel]
    xobs = xobs[sel]
    xcov = xcov[sel,:][:,sel]
    ycov = ycov[sel,:][:,sel]
    logslope = npr.normal(4,1)
    logyobs = npr.multivariate_normal(np.log(x) - logslope, ycov)
    return dict(numdata=xobs.size, xobs=xobs, logyobs=logyobs, xcov=xcov, ycov=ycov, logslope=logslope)

2 Likes