A hard modeling problem using Bayesian inference

@Bob_Carpenter

Problem Statement: Give an inference on the posterior distribution of (yT|x) using the common Bayesian approximated inference: p(yT|x) ~ p(x|yT)p(yT|yH). Now, let’s define each of these variables:

  • x is a data vector with size ncol x 1, where the density function of x|yT is: 1/[(2pi)^0.5*sigma_x)e^(x - A*yT)^2/(2*sigma_x^2) (this really just means x|yT ~ N(A*yT, sigma_x)), sigma_x = vector of size ncol x 1 with all entries equal to a positive constant, A = matrix with size nrow x ncol defined as:

A = matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1),
nrow, ncol, byrow= TRUE);

  • yT = a parameter vector with size ncol x 1 where the density function of yT is: 1/[(2pi)^0.5*sigma_y)e^(yT - yH)^2/(2*sigma_y^2)', sigma_y = vector of size ncol x 1 with all entries equal to a positive constant, sigma_y # sigma_x

  • yH = data vector with size ncol x 1, and yH is constructed as follows:

lambda = matrix(sample.int(100, size = ncol*nrow, replace = T),nrow,ncol);
lambda = lambda - diag(lambda)*diag(x=1, nrow, ncol);
yH = rpois(ncol,lambda) + rtruncnorm(ncol,0,1,mean = 0, sd = 1);

  • Key constraint: x = A*(yh+epsilon)

I hope this is sufficient for you to code. Please let me know if anything does not make sense to you.

Hi, could you be more specific about the question you have for this model?

Alright, I think I made a model that meets all the constraints. Because you have x defined in terms of yH, I defined that in a transformed parameters block.

I’m going to assume sigma_x and sigma_y are actually just scalars (since all entries of each are equal to a single positive constant?) and those density functions are products over ncol independent normals, but:

data {
  int ncol;
  int nrow;
  vector[ncol] yH;
  matrix[ncol, nrow] A;
  real epsilon; // Not sure what type epsilon is
}

transformed data {
  vector[ncol] x;
  x = A * (yH + epsilon);
}

parameters {
  vector[ncol] yT;
  real<lower=0.0> sigma_x;
  real<lower=0.0> sigma_y;
}

model {
  yT ~ normal(yH, sigma_y);
  x ~ normal(A * yT, sigma_x);
}

How’s that?

2 Likes

But yeah, answer @sakrejda s question. That model might be what you asked for, but it might not be what you want, so buyer beware, etc.

We get so overwhelmed with questions sometimes, I really try to push for specific questions (so we know when we’ve answered) but it’s nice some of us can be more generous at time :)

K

1 Like

Appreciate your question. Let me elaborate on the question above as follows:

What I want to know, is the distribution of yT given data column vector x, where x is inferred from the given column vector yH that reflects historical data through the equation:
x = A*(yh+epsilon) (A = matrix with size ncol x nrow, where in our case, let’s say ncol=16, nrow= 25).

@bbbales2: thank you very much for your help. epsilon is a column vector with size ncol x 1. I think your current code is exactly what I would need, except the definition of epsilon. Could you also help with the execution code in R (your code now is in Stan, which would not provide any output) to give out the approximated distribution of yT given data column vector x? The method I want to use for such approximation is MCMC.

library(rstan)
library(MASS)
library(truncnorm)
set.seed(1)
nrow=16;
ncol = 24;
sigma=0.2;
sigma2=0.3;
lambda = matrix(sample.int(100, size = ncol*nrow, replace = T),nrow,ncol);
lambda = lambda - diag(lambda)*diag(x=1, nrow, ncol);
yh = rpois(ncol,lambda) + rtruncnorm(ncol,0,1,mean = 0, sd = 1);
A = matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^1 - Done
0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^2 - Done
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^1 - Done
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^2 - Done
0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^1 - Done
0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^2 - Done
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^1 - Done
0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0, #x_E^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0, #x_E^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0, #x_F^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0, #x_F^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0, #x_G^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1, #x_G^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0, #x_H^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1), #x_H^2 - Done
nrow, ncol, byrow= TRUE);

It looks like you’re using R, so this should be about all you need to hook the data and model together and give it a run: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

You should just be able to change the definition of epsilon to be what you want.

Hope it works for you!

2 Likes

Thank you very much for your help. So, let’s assume I put your model code into the same R execution file and add this line:

modelString = "data {
int ncol;
int nrow;
vector[ncol] yH;
matrix[ncol, nrow] A;
real epsilon; // Not sure what type epsilon is
}

transformed data {
vector[ncol] x;
x = A * (yH + epsilon);
}

parameters {
vector[ncol] yT;
real<lower=0.0> sigma_x;
real<lower=0.0> sigma_y;
}

model {
yT ~ normal(yH, sigma_y);
x ~ normal(A * yT, sigma_x);
}".

Could you please help inspect whether the following execution would mean I am using MCMC sampling technique to infer the approximated distribution of yT|x:

truehist(x, col=“#B2001D”)
lines(density(x), col=“skyblue”, lwd=2)
summary(yT)
ret_sm ← stan_model(model_code = modelString) # Compile Stan code
fit ← sampling(ret_sm, warmup=100, iter=600, seed=1,
data=list(yh, N=length(yh)))
stan_trace(fit, inc_warmup = TRUE)

For all the input data in your Stan model, you need to provide stuff in the data argument of sampling:

data {
  int ncol;
  int nrow;
  vector[ncol] yH;
  matrix[ncol, nrow] A;
  real epsilon; // Not sure what type epsilon is
}

So you’ll need to pass in more things in data (to sampling) than you currently have.

Also, you need to have entries in a named list you pass to Stan. So instead of:
data=list(yh, N=length(yh))

You need to give the variables names (that match the stuff inside the Stan model):
data=list(yH = yh, N=length(yh), ...)

For instance (where … is just a placeholder for the other stuff you need to add)

1 Like

Thank you so much for your patience. I made the following changes to the data (in sampling function), but it has some really weird errors that I don’t understand why it happens. I copied my code below for your reference:

stan modeling
modelString= "data {
int ncol;
int nrow;
vector[ncol] yH;
matrix[ncol, nrow] A;
}

transformed data {
vector[ncol] x;
x = A * yH;
}

parameters {
vector[ncol] yT;
vector[ncol] sigma_x;
vector[ncol] sigma_y;
}

model {
yT ~ normal(yH, sigma_y);
x ~ normal(A * yT, sigma_x);
}"

#Execution Code
library(rstan)
library(MASS)
library(truncnorm)
nrow = 16;
ncol = 24;
sigma_x = matrix(0.2, ncol);
sigma_y = matrix(0.3, ncol);
lambda = matrix(sample.int(100, size = ncol*nrow, replace = T),nrow,ncol);
lambda = lambda - diag(lambda)*diag(x=1, nrow, ncol);
yh = rpois(ncol,lambda) + rtruncnorm(ncol,0,1,mean = 0, sd = 1);
A = matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^1 - Done
0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^2 - Done
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^1 - Done
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^2 - Done
0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^1 - Done
0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^2 - Done
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^1 - Done
0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0, #x_E^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0, #x_E^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0, #x_F^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0, #x_F^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0, #x_G^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1, #x_G^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0, #x_H^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1), #x_H^2 - Done
nrow, ncol, byrow= TRUE);

truehist(yH, col=“#B2001D”)
lines(density(yH), col=“skyblue”, lwd=2)
summary(yH) #summary of quartile results of yh

ret ← stanc(file=“NormaLDistribution.stan”)
ret_sm ← stan_model(model_code = modelString)
fit ← sampling(ret_sm, warmup=100, iter=600, seed=1, data=list(yH = yh, ncol = ncol, nrow=nrow, A=A))

The error message, which only occurs at sampling line is the following:

Error in new_CppObject_xp(fields$.module, fields$.pointer, …) :
variable does not exist; processing stage=data initialization; variable name=N; base type=int
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type ‘NULL’
2: In FUN(X[[i]], …) : data with name N is not numeric and not used
failed to create the sampler; sampling not done

Also, I got the graph from this line of code (truehist(yH, col="#B2001D")), but I don’t understand why the curve is broken at a point around 50 on the x-axis. Could you please help see if this is due to the error in the code, rather than from the modeling or data input part?

It’s definitely getting much better now (thanks to your great help!!), as I got something (not the posterior distribution of yT|x that I need yet)
Rplot.pdf (7.2 KB)

I took your code and ran it on my computer. Comments following:

(1) Oops, I defined A weird

You probably want matrix[nrow, ncol] A; not matrix[ncol, nrow] A;

Likewise, x is a column vector of length nrow.

(2) I seriously doubt you want this:

vector[ncol] sigma_x;
vector[ncol] sigma_y;

That is a lot of standard deviations to estimate for the amount of data here. I’m totally guessing, but you probably are looking for something like this:

real<lower=0.0> sigma_x;
real<lower=0.0> sigma_y;

I left out the bounds when I put the model up this morning. My mistake!

(3) I’m not sure what this line does in the R:

ret <- stanc(file="NormaLDistribution.stan")

So I removed it.

I did not end up getting the error you got. Might be worth clearing out your R workspace.

The model that I did end up getting to run spat out this warning: 1: There were 595 divergent transitions after warmup. Which basically means the results are meaningless. Now we have the affirmative answer from Stan that the model is messed up in some way.

Step 1 is to go back and think about what it is you’re trying to do from the get go (@sakrejda’s question). Look for examples of other stuff in the manual that sound similar to things you want.

Hope this helps!

By the way, it’s very much a good idea to copy models from other people. Don’t be too focused on fixing whatever it is we cobbled together today. There’s almost certainly something in the Stan manual or one of the Rstan/Rstanarm vignettes that is similar to what you want to do.

1 Like

Thank you so much for your help! I think I have to define the sigma_x and sigma_y as a vector with size ncol x 1 because each entry of column vectors x and y has different measurement errors, so I need different variance corresponding to each of them. So I keep both of them the same as above.

Now, my current workable execution code is follows:

truehist(yH, col=“#B2001D”)
lines(density(yH), col=“skyblue”, lwd=2)
summary(yH) #summary of quartile results of yh
ret_sm ← stan_model(model_code = modelString) # Compile Stan code
fit ← sampling(ret_sm, warmup=100, iter=600, seed=1, data=list(yH = yH, ncol = ncol, nrow=nrow, A=A))

When I run all the lines above, it does not show any errors. But then I could not really observe two things:

  1. Is the fit created by MCMC sampling, or is it created by another sampling technique? I want MCMC sampling;p
  2. I am completely clueless about this fit variable. So I decide to plot it to see what it represents, and whether I could make any useful inference from this variable. I tried either one of these two lines, but both failed!

stan_trace(fit, inc_warmup = TRUE)
stan_hist(fit)

The error message is: “Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.” “error occurred during calling the sampler; sampling not done”.

But then I have another thought: why do I want to plot yH rather than yT, which is the variable that I need to find its distribution on? So I replaced yH by yT everywhere in the workable execution code, and now it fails to work with the error message: “Error in truehist(yT, col = “#B2001D”) : object ‘yT’ not found.” I really don’t understand why this fails though, as if Stan works, why this does not use Bayesian to get yT, rather than saying yT not found??? Very upset.

Thank you so much for your advice. I did very hard to find a similar code, but could not find any code that is similar to what I need here (because I have two known parameters, one depends on the other, and those two parameters have two different distributions. No code seems to solve this problem, or at least I could not find a code using Bayesian with >= 2 parameters). Hopeless:(

This is a noisy predictor model and there’s a chapter in the manual on noisy measurement models (it’s also discussed in Gelman et al.'s Bayesian Data Analysis).

To make it a proper Bayesian generative model, you want to turn this around, because what you’re observing is yH, and the true unknown value value is yT. So what @bbbales2 wrote as

yT ~ normal(yH, sigma_y);

is traditioanlly written as

yH ~ normal(yT, sigma_yH);

Same density contribution, so the model doesn’t change, but the notion of what is the likelihood does.

The other sampling statement, also keys off the true value yT, so this one is right as written,

x ~ normal(A * yT, sigma_x);

other than that almost everyone uses x for predictors and y for observations, so it’s really confusing to see it written this way (that’s because math usually treats y as a function of x rather than the other way around).

No way you’re going to be able to fit a different sigma_x[n] for each x[n], especially without a prior. You probably just want a scalar value there. And you are going to want priors in general for all the usual reasons.

2 Likes

Thank you so much for your comments, Dr. Carpenter! But why do you say that this model is without prior? Isn’t (yT|yH) ~ normal (yH, sigma_yH) a prior already? Currently, my input code for sigma_x and sigma_y is:

sigma_x = matrix(0.2, ncol);
sigma_y = matrix(0.3, ncol);

Is this true? If not, why is that?

No priors on any of the parameters: yT, sigma_x, or sigma_y.

You want to sample yH | yT not the other way around. Then yH gets its sampling distribution conditioned on yT and then yT would have some kind of prior (improper flat in the model as @bbbales2 wrote it).

But the Bayesian math does not make sense with your suggested way of sampling?? Because from Bayes’ rule, we have: p(yT|x) is proportional to p(x | yT)p(yT | yH). Now if we flip p(yT | yH) by p(yH | yT), that makes Bayes’ rule fail;( Also, why do we want to sample data given an unknown, rather than sample an unknown, given data?

I changed it based on your suggestion (please help look at the code below to help me see what went wrong:()

modelString= "data {
int ncol;
int nrow;
vector[ncol] yH;
matrix[nrow, ncol] A;
}

transformed data {
vector[nrow] x;
x = A * yH;
}

parameters {
vector[ncol] yT;
vector[ncol] sigma_x;
vector[ncol] sigma_y;
}

model {
yH ~ normal(yT, sigma_y);
x ~ normal(A * yT, sigma_x);
}"

#Execution Code
library(rstan)
library(MASS)
library(truncnorm)
nrow = 16;
ncol = 24;
sigma_x = matrix(0.2, ncol);
sigma_y = matrix(0.3, ncol);
lambda = matrix(sample.int(100, size = ncol*nrow, replace = T),nrow,ncol);
lambda = lambda - diag(lambda)*diag(x=1, nrow, ncol);
yH = rpois(ncol,lambda) + rtruncnorm(ncol,0,1,mean = 0, sd = 1);
A = matrix(c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^1 - Done
0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_A^2 - Done
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^1 - Done
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_B^2 - Done
0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^1 - Done
0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_C^2 - Done
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^1 - Done
0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, #x_D^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0, #x_E^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0, #x_E^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0, #x_F^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0, #x_F^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0, #x_G^2 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1, #x_G^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0, #x_H^1 - Done
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1), #x_H^2 - Done
nrow, ncol, byrow= TRUE);

truehist(yT, col=“#B2001D”)
lines(density(yT), col=“skyblue”, lwd=2)
summary(yT) #summary of quartile results of yh
ret_sm ← stan_model(model_code = modelString) # Compile Stan code
fit ← sampling(ret_sm, warmup=100, iter=600, seed=1, data=list(yH = yH, ncol = ncol, nrow=nrow, A=A))

The error message I got is simple: “Error in truehist(yT, col = “#B2001D”) : object ‘yT’ not found.” Why is this the case, if what you thought was correct??

Now, if i only execute these two key lines:

ret_sm ← stan_model(model_code = modelString) # Compile Stan code
fit ← sampling(ret_sm, warmup=100, iter=600, seed=1, data=list(yH = yH, ncol = ncol, nrow=nrow, A=A))

It works perfectly! So I don’t understand why the other 3 lines failed to execute:((

The error “Error in truehist(yT, col = "#B2001D") : object 'yT' not found” is coming from the R code.

yT is a parameter in the Stan model, but it is not a variable in the R code. If you want to visualize your inferences on yT, look at the documentation for the RStan function ‘extract’. This is the appropriate Vignette: https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html .

Good luck!

1 Like