Simple model Parse Errors

specification

#1

I am a new Stan user and am trying to model a simple problem where two thermometers are measuring a temperature noisily. In addition due to the dynamic range of the thermometers their readings are clipped before being observed (reported).

Here is my attempt which leads to parse errors. Any help will be greatly appreciated.

data {
  int<lower=0> N; // number of samples
  real farenheit_range[2]; //range of the farenheit thermometer
  real celcius_range[2]; //range of the celcius thermometer
  real farenheit_sigma; //error of the farenheit thermometer
  real celcius_sigma;  //error of the celcius thermometer
  real farenheit_reported[N];  //data from the farenheit thermometer
  real celcius_reported[N]; //data from the celcius thermometer
}

parameters {
  real<lower=15,upper=100> true_temp;
  real farenheit_raw[N];
  real celcius_raw[N];
} 

model {
  for(n in 1:N){
    farenheit_raw[n] ~ normal(true_temp * 1.8 + 32, farenheit_sigma);
    celcius_raw[n] ~ normal(true_temp, celcius_sigma);
  }
}

generated quantities{
  for(n in 1:N){
    farenheit_reported[n] = fmin(farenheit_raw[n], farenheit_range[2]);
    farenheit_reported[n] = fmax(farenheit_reported[n], farenheit_range[1]);
    
    celcius_reported[n] = fmin(celcius_raw[n], celcius_range[2]);
    celcius_reported[n] = fmax(celcius_reported[n], celcius_range[1]);
  }
}

The error that I receive is

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Cannot assign to variable outside of declaration block; left-hand-side variable origin=data
error in ‘model50cf6869e9cc_97838646f0b1cef177fd980329382d24’ at line 27, column 28

25: generated quantities{
26:   for(n in 1:N){
27:     farenheit_reported[n] = fmin(farenheit_raw[n], farenheit_range[2]);
                               ^
28:     farenheit_reported[n] = fmax(farenheit_reported[n], farenheit_range[1]);

PARSER EXPECTED:
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model ‘97838646f0b1cef177fd980329382d24’ due to the above error.


#2

declare the variable farenheit_reported[N] in the generated quantities block.

variables declared at the top-level of a data, transformed data, parameters, and transformed parameters block are global in that they can be referenced from all subsequent program blocks, but you can only assign to a variable in the block in which it is declared.

(note: data and parmeters blocks are declaration only).

this means that in the generated quantities block you can reference data variable N, but in order to assign to your reported variables, they must be declared in that block.


#3

Thanks! Can I still condition on farenheit_reported even though they are in the generated quantities block? I actually tried that but both the MCMC and the VB methods don’t converge to the true_temp parameter even with a lot of observed data.


#4

I’m not sure what you’re suggesting.

you can’t reference variables declared in a later program block in an earlier program block, so no, you can’t set things up that way.


#5

In my model the farenheit_reported and celcius_reported are observed variables, i.e., data. However, as you have recommended their declaration goes in the generated quantities block.

My question is whether I can still pass in the observed values as part of the data argument in the stan() call. And if I do that will my posterior be conditioned on those variables?

Apologies if I am not being very clear and thanks much for your help.


#6

your posterior is determined by the statements in the model block.
to deal with truncated data, you should check the Stan manual.

if your _reported information is coming in as data, then it should be declared in the data block.

the generated quantities block is run after the posterior has been updated. its purpose is to computed quantities of interest based on the posterior.

again, please find examples in the Stan manual or in Stan’s example models on how work with truncated data. I’m sorry I can’t answer this myself very well, as I’m a programmer with very little modeling experience.


#7

Thanks for your help, in any case. The data is not truncated but clipped, i.e., the values outside the range are not thrown away but observed as the edge values.

I don’t quite understand how to incorporate observed data that are transformations of other variables. The few different things I tried (putting it in the model, in transformed params etc.) all gave the same error. I couldn’t find the answers in the Stan manual either.

Do you recommend posting the modeling question in the forum again as another topic?


#8

Those are called “censored”. The terminology is terribly confusing. I go over it in the manual in the chapter “Truncated or Censored Data”. For censoring, you basically just code up a reading of min using the lcdf or the max using the lccdf.

What are you trying to do? You can define transformed data variables that are transforms of the actual data you read in. Or you can define transformed parameters that combine data and parameters (those can also be local variables in the model if you don’t want to save them).


#9

Thanks for the tip. I will read the Stan manual to see if I can follow the examples about using censoring for my problem.

I have a more general question about modeling processes of the following type. Assume

z1 ~ Normal(mu1, sigma1)
z2 ~ Normal(mu2, sigma2)
x = f(z1, z2)

where f is a deterministic function.

If I observe x and am trying to infer E[z1, z2 | x], how do I go about modeling this in Stan?

Looks like I’ll have to put x in the data section, but when I try doing that but declare x = f(z1, z2) in the model section I get the “cannot assign variable outside declaration block” error.


Unable to recover simulated parameters
#10

This is partly a Stan question and partly a stats question.

For Stan, you need to define any derived quantities that depend on parameters in either the transformed parameters block or as a local variable in the model block (or in generated quantities if the log density doesn’t depend on the output).

The stats problem is more subtle and is going to depend on what f is and other information you have in the model. Typically, you need a Jacobian adjustment when changing variables this way and trying to put a distribution on the transform. Here, you need to be able to map from x back to z1 and z2. The manual explains how this works in the chapter on transformed parameters and also the entire chapter on how the internal transform works.


#11

Bob,

Since f is not invertible, how would I specify a map from x to z1 and z2? For a simple case where x = z1^2 + z2^2 the log-likelihood of x is difficult to derive from the fact that z1 and z2 are normally distributed. How does Stan go about performing inference of E[z1|x]? Could you suggest a reference which explains how the log-likelihood calculation works in the cases where Bayesian models have non-bijective transformations of parameters?

I couldn’t find the string “internal transform” in the Stan manual. Do you perhaps mean “inverse transform”?


#12

chapter “Transformations of Constrained Variables”

try rewriting the model in terms of the data-generating process directly, instead of using the function f to recover multiple parameters based on a single number.


#13

Thanks for that.

Unfortunately in my case the data generation process is naturally specified as how I have in the example. My observations are outputs of a deterministic function of latent variables, and I am interested in recovering the latent variables.


#14

The fundamental problem is degrees of freedom. If you have multiple latent variables that are run through a many-to-one function, and then you try to fit with the result, the estimates for the latent variables are not determined. As a simple example, in a + b = 5, the values of a and b are not determined. Same thing in a statistical model if you try to model (a + b) and then use that to recover a and b.

If you don’t have independent information on a and b there’s just nothing you can do. In the statistical setting, you can add priors, but then the scale of a and b is fully determined by the prior.

This is the fundamental reason you can’t compute the Jacobian—it needs to preserve volume (or area in the 2D case). If you go from 2 dimensions (area) down to 1 dimension (line), you no longer have any area. So there’s no Jacobian that can compensate for the change in area.


#15

I am precisely in the setting where I have priors. In your example assume

a ~ Normal(1, sd = 1)
b ~ Normal(0, sd = 2)
x = a + b

Then a, b and x are jointly Gaussian with mean vector (1, 0, 1) and covariance matrix
1 0 1
0 4 4
1 4 5

Even though this matrix is singular, I can still make pairwise inferences. In particular I can calculate E[a | x = 5] = 1/5 * (5 - 1) + 1= 1.8, which is not trivially obvious from the model and the observation that x= 5. (In fact, a|x=5 ~ Normal(1.8, sqrt(.8)))

The above calculation seems straightforward and completely in keeping with the principles of Bayesian reasoning (I hope there are no errors in my calculations above).

What I don’t understand it is what about automated inference in Stan (and other PPLs) makes this hard.


#16

If a, b, and x were jointly Gaussian, then they would have support in all of R^3. But that’s not the case—they only have support where x = a + b. There’s no way to model this in a proper multivariate normal because the covariance matrix isn’t positive definite. You can see the rank degeneracy of your matrix just by inspection of the columsn (the third of which is the sum of the first two). That means you can’t invert it, which means you can’t evaluate the quadratic form inside the definition of the density.

The more general problem this induces is of identifiabiltiy. The data doesn’t determine a or b, just their sum, and there are infinitely many possibilities producing that sum.


#17

Agreed. I did allude to the fact that the covariance matrix is singular in my post. However, as I showed the conditional density of a|x = 5 is definitely not degenerate.

Even though jointly a, b and x are on a plane, i.e., from a degenerate distribution, the priors on a and b do render the posterior of a and b given x = 5 a proper distribution.

As Gelman says in this post http://andrewgelman.com/2014/02/12/think-identifiability-bayesian-inference/ “a Bayesian model is identified if the posterior distribution is proper”


#18

OK, then I think we’re on the same page. The likelihood isn’t classically identified, but with a prior, you get identification. The prior gives you identification even without data. Andrew has several notions of identification just in his book with Jennifer—you need to generalize to Bayes and to practice where you actually care about particular data sets.

To be clearer about what Stan’s trying to do, it unconstrains the parameters to have support on R^N. It assumes that it has support over all that, not just over a lower-dimensional submanifold. Stna will let you build the above model

a ~ foo(...);
b ~ bar(...);
x = a + b;
x ~ baz(...);

The real problem is that in this simple case, a and b are only identified through their priors.

Stan tries to unconstrain the parameters to have support on R^N. We don’t allow hard relations that produce submanifolds as there’s no way to produce a volume-preserving Jacobian. So we could parameterize this in terms of a and b as above and x could be defined as a transformed parameter.


#19

I see. Thanks. I assume baz() in your example is the marginal distribution of x, derived from the fact that x = a+b.

That might work for my problem, where I have an observed quantity that is a deterministic non-linear function of some latent parameters (for which have priors). I will just have to figure out how to derive the marginal distribution of my observed quantity.

ps. I really appreciate your (and Mitzi’s) thoughtful responses to my questions.


#20

That’s how it usually comes up. Stan lets you just define x = a + b as a local variable or transformed parameter if it’s easier to put distributions on a and `b.