How to force PyStan to recompile a stan model?

I have a weighted Bayesian Logistic Regression model

weighted_stan_representation = """
data {
  int<lower=0> n; // number of observations
  int<lower=0> d; // number of predictors
  array[n] int<lower=0,upper=1> y; // outputs
  matrix[n,d] x; // inputs
  vector<lower=0>[n] w; // coreset weights
}
parameters {
  vector[d] theta; // auxiliary parameter
}
model {
  theta ~ normal(0, 1);
  target += w*bernoulli_logit_lpmf(y| x*theta);
  
  
}

with data as such:

{'x': array([[-1.92220908, -0.86248914],
        [-0.64517094,  0.40222473],
        [-0.71675321, -1.2782317 ],
        ...,
        [-2.0448459 , -0.11735602],
        [-0.9622542 , -2.27172399],
        [-1.09545494, -0.83435958]]),
 'y': array([0, 0, 0, ..., 0, 0, 0]),
 'w': array([1., 1., 1., ..., 1., 1., 1.]),
 'd': 2,
 'n': 10000}

I can get samples from the full posterior, i.e. with weights uniformly 1 by running

posterior = stan.build(model.weighted_stan_representation, data = full_data, random_seed = 100000)
fit = posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

And I then want to use a sparse weight vector, and sample from the approximate sparse posterior using

coreset_posterior = stan.build(model.weighted_stan_representation, data = coreset_data)
coreset_samples = coreset_posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

However when I access the samples, they are exactly equivalent between the two cases. I’m confident it has something to do with the model being cached when stan.build is first called, and so no new samples are ever actually being taken. This is because I get this output

Building: found in cache, done.

when I run the second stan representation. This is the first time I’ve used PyStan and I don’t know how to get around this. There doesn’t seem to be an option to force PyStan to recompile as far as I can tell.

Any help would be appreciated!

I’ve got the latest version of Python and PyStan installed.

The model is cached → then model + data is combined and posterior object is returned for the user.

So something else is going on here.

When you say they are equivalent, what does that mean? How different are the datasets?

1 Like

The samples given back by the two calls are exactly identical. The second dataset is a small weighted subset of the first, e.g.

{'x': array([[ 0.76792184,  0.77400448],
        [ 0.63198014,  0.78407462],
        [ 0.34475047, -3.33147339],
        [-2.96240585,  0.72891401],
        [-0.79129827,  2.89731581]]),
 'y': array([1, 1, 0, 0, 1]),
 'w': array([23.58330934, 19.34682126, 13.67691998,  8.92135991,  4.41853865]),
 'd': 2,
 'n': 5}

Here are the full samples

array([[2.96173745, 2.96334996, 2.96171992, ..., 2.9617269 , 2.96216384,
        2.96209904],
       [2.94513431, 2.94605177, 2.94542099, ..., 2.94468218, 2.9446128 ,
        2.9453812 ]])

and here are the coreset_samples

array([[2.96173745, 2.96334996, 2.96171992, ..., 2.9617269 , 2.96216384,
        2.96209904],
       [2.94513431, 2.94605177, 2.94542099, ..., 2.94468218, 2.9446128 ,
        2.9453812 ]])

It is extra odd because it appears like time is spent when we sample the coreset_samples, event though the end result is identical.

Thanks for the quick response!

That sounds weird.

Can you show your code (sampling + printing)?

Sure!

This is where I find the full samples

# Perform MCMC on the full data set

# Set MCMC parameters
num_chains = 4
num_samples = 2500
num_warmup = 1000

# Convert data to stan format
full_data = stanvert(X, Y, np.ones(X.shape[0]))

# Build the stan model and sample
posterior = stan.build(model.weighted_stan_representation, data = full_data, random_seed = 100000)
fit = posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

# Extract the samples
full_samples = fit["theta"]

This is the for loop where I’m computing the coreset_samples

# Define vector of corset sizes we will compute metrics for
M = np.unique(np.linspace(1, 5, 5, dtype=np.int32))

# Build the coreset and compute metrics at each step, initialise timer
timer = 0 
for m in range(M.shape[0]):
    print('M = ' + str(M[m]) + ': Constructing coreset')
    # Set next number of corest iterations to run
    itrs = (M[m] if m == 0 else M[m] - M[m - 1])

    # Build the coreset and time it
    t0 = time.process_time()
    sparseVI_fast.build(itrs)
    timer += time.process_time() - t0
    
    print('M = ' + str(M[m]) + ': Converting data to stan format')
    # Get the coreset weights, points and indices
    w, Z_c, idcs = sparseVI_fast.get()
    
    # Calculate original covariate values for weighted data and get responses
    X_c = Z_c * Y[idcs][:, np.newaxis]
    Y_c = Y[idcs]
    
    # Convert data to stan format
    coreset_data = stanvert(X_c, Y_c, w)
    
    print('M = ' + str(M[m]) + ': MCMC')
    # Build the stan model and sample
    coreset_posterior = stan.build(model.weighted_stan_representation, data = coreset_data, random_seed = m)
    coreset_samples = coreset_posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

    # Extract the samples
    coreset_samples = fit["theta"]

print('Done!')

When I’m building the coreset in the above for loop, all it is doing is finding a weighted subset of the data such as

{'x': array([[ 0.76792184,  0.77400448],
        [ 0.63198014,  0.78407462],
        [ 0.34475047, -3.33147339],
        [-2.96240585,  0.72891401],
        [-0.79129827,  2.89731581]]),
 'y': array([1, 1, 0, 0, 1]),
 'w': array([23.58330934, 19.34682126, 13.67691998,  8.92135991,  4.41853865]),
 'd': 2,
 'n': 5}

Then I print the samples just by calling

coreset_samples

and

full_samples

Sure!

I compute the full samples here

# Perform MCMC on the full data set

# Set MCMC parameters
num_chains = 4
num_samples = 2500
num_warmup = 1000

# Convert data to stan format
full_data = stanvert(X, Y, np.ones(X.shape[0]))

# Build the stan model and sample
posterior = stan.build(model.weighted_stan_representation, data = full_data, random_seed = 100000)
fit = posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

# Extract the samples
full_samples = fit["theta"]

and I want to sample from several approximate posteriors in a for loop

# Define vector of corset sizes we will compute metrics for
M = np.unique(np.linspace(1, 5, 5, dtype=np.int32))

# Build the coreset and compute metrics at each step, initialise timer
timer = 0 
for m in range(M.shape[0]):
    print('M = ' + str(M[m]) + ': Constructing coreset')
    # Set next number of corest iterations to run
    itrs = (M[m] if m == 0 else M[m] - M[m - 1])

    # Build the coreset and time it
    t0 = time.process_time()
    sparseVI_fast.build(itrs)
    timer += time.process_time() - t0
    
    print('M = ' + str(M[m]) + ': Converting data to stan format')
    # Get the coreset weights, points and indices
    w, Z_c, idcs = sparseVI_fast.get()
    
    # Calculate original covariate values for weighted data and get responses
    X_c = Z_c * Y[idcs][:, np.newaxis]
    Y_c = Y[idcs]
    
    # Convert data to stan format
    coreset_data = stanvert(X_c, Y_c, w)
    
    print('M = ' + str(M[m]) + ': MCMC')
    # Build the stan model and sample
    coreset_posterior = stan.build(model.weighted_stan_representation, data = coreset_data, random_seed = m)
    coreset_samples = coreset_posterior.sample(num_chains = num_chains, num_samples = num_samples, num_warmup = num_warmup)

    # Extract the samples
    coreset_samples = fit["theta"]
  
    
print('Done!')

When I build the coreset in the for loop, all that boils down to is finding a weighted subset of the data such as

{'x': array([[ 0.76792184,  0.77400448],
        [ 0.63198014,  0.78407462],
        [ 0.34475047, -3.33147339],
        [-2.96240585,  0.72891401],
        [-0.79129827,  2.89731581]]),
 'y': array([1, 1, 0, 0, 1]),
 'w': array([23.58330934, 19.34682126, 13.67691998,  8.92135991,  4.41853865]),
 'd': 2,
 'n': 5}

I then just print the samples using

coreset_samples

and

full_samples

I found a bug in my code and now it seems to be working fine. I was extracting the wrong samples in the line

coreset_samples = fit["theta"]

fixing that seems to have sorted the problem.

Apologies for wasting your time, couldn’t see the trees for the forest!

1 Like

Great that this got fixed.

1 Like

Just to also jump in here, I’m not sure if your model is doing what you intend here:

  target += w*bernoulli_logit_lpmf(y| x*theta);

When you call the lpmf functions with vector inputs they return the sum across all inputs. So when you specify:

bernoulli_logit_lpmf(y| x*theta)

This returns the log-probability of the entire set of observations y. If you then multiply it by the vector w, this is essentially saying that you’ve observed the entire set of observations y w[1] times and w[2] times, etc. In other words, your current statement is equivalent to:

for(i in 1:N) {
  target() += w[i]*bernoulli_logit_lpmf(y| x*theta);
}

If you’re intending to apply each weight to each element of y, you’ll need to instead specify your likelihood as a loop over both y and w:

model {
  vector[N] p = x * theta;
  for(i in 1:N) {
    target() += w[i] * bernoulli_logit_lpmf(y[i] | p[i]);
  }
}

Apologies if the model was already specified as intended though!

1 Like

Ah thank you! That was definitely unintended. Am I specifying the prior Theta ~ N(0, 1) incorrectly here too? Theta is a two dimensional vector of parameters.

model {
  theta ~ normal(0, 1);
  vector[n] p = x * theta;
  for(i in 1:n) {
    target += w[i] * bernoulli_logit_lpmf(y[i] | p[i]);
  } 
}

Yep theta is fine, as long as when you say ‘two dimensional vector of parameters’ you mean a single D \text{ x } 1 vector?

Yep! Lovely, thank you for your help!

1 Like