Fairly simple hierarchical model with odd Rhat and neff values

I have created a fairly simple hierarchical model to model some astronomical data where I find betas to the equation y = beta0 + beta1*x1 + beta2*x2
The model goes as follows (where beta0 is beta1, beta1 is beta2, and beta2 is beta3):

 data {
      int<lower=0> nobs;
      vector[nobs] y;
      vector[nobs] tau_0;
      //errors
      real tau_1[nobs];
      real tau_2[nobs];
      //observed values
      real x1[nobs];
      real x2[nobs];

    }
transformed data{
        //absolute value
        real var1 [nobs];
        real var2 [nobs];
        vector[nobs] var0;
        for (i in 1:nobs){
                var1[i] = fabs(tau_1[i]);
                var2[i] = fabs(tau_2[i]);
                var0[i] = fabs(tau_0[i]);
                }
        }
parameters {
        //actual values
        real x_real1[nobs];
        real x_real2[nobs];
        //desired parameters
        real beta1;
        real beta2;
        real beta3;    

    }
transformed parameters{
        //model formula
        vector[nobs] muy;
        for (i in 1:nobs) {
                muy[i] = beta1 + beta2 * x_real1[i] + beta3 * x_real2[i];

    }
    }
model {
       x_real1 ~ normal(0, 10);
       x1 ~ normal(x_real1, var1);
       x_real2 ~ normal(-1, 10);
       x2 ~ normal(x_real2, var2);
       y ~ normal(muy, var0);
       }

Where I have nobs = 1000
I run this in pystan using

fit = pystan.stan(model_code=stan_code, data=toy_data, iter=5000, chains=3,
verbose=False, n_jobs=3)

The Rhat values for beta are 2.6, 42.83, and 33.03 respectively, and with neff equal to 2 for all of them, clearly worrying values.

Is there something flawed in my stan model that is causing this?
In case youā€™re interested, this is the entire file that I use to extract data as well:
stan_forum_upload.py (4.7 KB)
The data I use is from the gaia catologue of RR Lyrae variables, with x1 being period and x2 being metallicity (similar to what is done in this paper, though with some modifications). The file of the data I used is too big to post here, although there are 26,000 rows, I am only using 1,000 here to save processing space at first.

Iā€™ve been playing around with the model, and when using the plot() method to visualize the posteriors, I get this:


Whatā€™s interesting is that Beta0 at the bottom, the one that has the lowest Rhat value, also appears to be the only beta that doesnā€™t have two areas of high frequency in the posterior.
But now my question is what does the two means in beta1 and beta2 represent? Is this still something wrong in my model or is this something fundamental in the data?

Thatā€™s probably an identification/multi-modality issue. Iā€™ll have a closer look - maybe later today, or tomorrow, okay?

1 Like

Hi Sam,

Admittedly, I never used Python before, but before interpreting the sampling traces, I would look at the model codeā€¦
If you want to estimate beta0, beta1 and beta2 hierarchically, this requires sampling these values from hyper priors/posteriors, with a mean (e.g. mu_beta0) and a standard deviation (e.g. sd_beta0). But what you have actually seemed to have done (which I might misinterpret due to ā€˜language-issuesā€™) is:

model {
       x_real1 ~ normal(0, 10);
       x1 ~ normal(x_real1, var1);
       x_real2 ~ normal(-1, 10);
       x2 ~ normal(x_real2, var2);
       y ~ normal(muy, var0);
       }

ā€¦ sampling the - real x values??? :)
This looks really wrong :))) (no offense)
In the coding languages I know, it should look like starting with these hierarchical parameter priors::

model {
       var0 ~ normal(1, var0);  ## where var0 is the estimated standard deviation for y (which might be better estimated differently...)
       mu_beta0 ~ normal(1,10)
       mu_beta1 ~ normal(1,10)
       mu_beta2 ~ normal(1,10)
      sd_beta0 ~ normal(1,10)
      sd_beta1 ~ normal(1,10)
      sd_beta2 ~ normal(1,10)
       }

Then you give these priors to the ā€˜individual levelā€™ for sampling from the population of parameters:

still the mode....l {
for (i in 1:nobs) {
     i_beta0[i]~normal(mu_beta0,sd_beta0)
     i_beta1[i]~normal(mu_beta1,sd_beta1)
     i_beta2[i]~normal(mu_beta2,sd_beta2)


      muy[i] = i_beta0[i] + i_beta1[i] * x_real1[i] + i_beta2[i] * x_real2[i];
      
    y[i]~normal(muy[i],var0)
       }
}

Where y is the value you finally predict.

In words; for each data point you sample -the parameter in question- (not the observed data ā€˜real_xā€™), which introduces uncertainty in the parameter (not the observed data). Still you can (later) include estimates about the variance of the data-points (e.g. x1) itself, for other purposes, but this is a rather different question than estimating the beta parameters in a hierarchical regression model.

Hope this helps,
Best RenƩ

Ps: Maybe I miss the point, because you introduce your model as a multivariate model, which it is not:
multivariate would be to predict several y-variables. So it seems there are things mixing up here.

PPs: looked into your code again, and still think (although I never used this language) there are priors missing for the parameters you want to estimate (beta0ā€¦) which looks really odd to me. Also, I would suggest to check how the residuals of a linear model are estimated.I think just freely estimating it during fitting predictions on y is not entirely appropriate, but might be better anchored on data.

1 Like

Ok I think I see. Sorry, Iā€™m quite new to stan and I canā€™t believe I forgot priors for beta (I feel like I had them at some point and maybe accidentally deleted them).

The var1 and var2 are the errors on the x values as supplied in the catalogue. The ā€œtrueā€ value is treated as an unknown, with the observed values (or x1 and x2) treated as knowns from the data. It would bias the estimate of the parameters if this error wasnā€™t included. The same goes for the error on y, which is a known from the catalogue. Iā€™m basing this off of https://mc-stan.org/docs/2_18/stan-users-guide/bayesian-measurement-error-model.html, but I called x ā€œx-realā€ and the observed x value simply ā€œx.ā€ Sorry if the naming caused confusion. Iā€™ll rename them when I edit the code again.

Sorry about the mixup for the multivariate name. I changed it, but I guess I wasnā€™t fully aware of what multivariate meant!

Iā€™ll run the model and see what happens (itā€™s taking a while with the priors on beta, Iā€™ll update this with what happens).

One followup question:
Why did you choose to put the mean for y (muy[i]) expression in the model and not transformed parameters section? Is that where it is supposed to go?
Thanks!

Update with values:
So with these updates, the Rhat value is a bit better (but still bad), but the plots still look like this:

Itā€™s more of a measurement error model I think. And sometimes multivariate refers multiple explanatory variables - which can be confusing.

Good priors are definitely a start, but I think the fundamental multi-modality issue wonā€™t be solved. Itā€™s a bit to late for me to dig into this model a bit more, but I will do so tomorrow.

@Sam-2727 can you post a subset of your data here? The data that goes into Stan. I canā€™t replicate your python code (Iā€™m using R).

Iā€™ll post the csv file tomorrow when I have to make the subset

Yes. Convenienceā€¦ :) (I usually separate those as well).

Okay, now if you are convinced that your model makes sense on a formal level, then looking at the last trace plots you provide reveals three constellations of ā€˜signsā€™ of parameter estimates (in this trace-temporal order):

  1. (+) beta0 (+) beta1 (+) beta0
  2. (-) beta0 (-) beta1 (-) beta0
  3. (-) beta0 (+) beta1 (+) beta0
    If you put this into an actual equation you get (with rounded beta values):

1ā€™) y= .1 + .07x1 + .1x2
2ā€™) y= -.1 - .1x1 - .2x2
3ā€™) y= -.1 + .05x1 + .1x2

If you only look at 2) and 3) it is most obvious what happens
Namely, while beta0 is unchanged, the same y can be predicted by changing the signs of beta1 and beta2. This immediately means (in classical phrasing) that if you would only take one of these solutions (e.g. model 2), then the residuals of this model would not be normally distributed, which violates the most basic assumption of regression models (i.e. independent errors).
Independent errors (residuals) are very important because if they are not independent, then there is systematic variance in the data which the model does not capture, which makes the model results uninterpretable.
Equations 2) and 3) also point towards a likely cause: if both of these formulas are likely, then the y-variable seems to be ā€œbimodalā€. Or, there seems to be a third variable, which moderates the direction of the effect (positive or negative).

But a standard suggestion would also be to check whether x1 and x2 actually correlate, which would be a substantial problem, as this causes model identification issues (usually discussed in terms of regression ā€˜toleranceā€™).

Hope this helps,
Best, RenƩ

Ps:
On the x1 x2 estimation-and-bias issue.
I can see that it is a reasonable question whether the measurement itself has a specific error (and how much), but I can not see how this question, in the given model, is non-identical to saying that parameters (beta1 beta2) vary with some error. And this might indeed be the source of the problem, I think:

For instance,
let x1 be fixed on the actual measurement, and let beta1 randomly vary around the ā€˜overall/trueā€™ effect. (Which is the common definition of a hierarchical model);
compare this with:
let x1ā€™ be estimated varying around the actual measurement and let beta1ā€™ be fixed.

Both definitions are formally identical when explaining variance in y:
y= x1 * beta1ā€™
y=x1ā€™ * beta1
They just scale differently.
And if you finally estimate both variables/parameters
y= x1ā€™ * beta1ā€™
The become unidentifiably, because large estimates of x1ā€™ compensate small estimates of beta1ā€™, and vice versa.

Does this sound plausible for your case? (I am still having issues to read the model code, but it looks like a problem).

Hereā€™s the csv file:
results_csvexport11.csv (12.1 KB)

Iā€™ve left it with the variable names present in the stan model. The errors on y and y itself are from my calculations, but Iā€™ve reviewed (and revised a bit) the code for those and they seem pretty solid so Iā€™m pretty sure the problem either lies in my model or the data. This is just 100 observations, tell me if itā€™s better if I post more.

@ReneTwo, Iā€™ll measure correlation between variables, will get back to you on that (although past research would suggest that there should be a correlation, see this paper, which Iā€™m essentially replicating with a different data set).

For the errors, I think youā€™re mostly correct on that there really isnā€™t a difference between errors on beta and x, except for the fact that the x errors are machine reported, so represent ā€œaccounted forā€ errors that are known. But putting the errors on beta is just to account for any other unaccounted for errors in the model. Because the errors are still gaussian, I donā€™t think this should introduce any bias or make the relationship harder to identify. I could be totally wrong in this reasoning, but after all, https://mc-stan.org/docs/2_18/stan-users-guide/bayesian-measurement-error-model.html uses priors on beta and x as well.

Update: I revised the values of y a bit after noticing a slight calculation error on my part,

If youā€™re thinking of running this model, hereā€™s the current stan code I am using:

data {
      int<lower=0> nobs;
      vector[nobs] y;
      vector[nobs] tau_0;
      //errors
      real tau_1[nobs];
      real tau_2[nobs];
      //observed values
      real x1[nobs];
      real x2[nobs];

    }
transformed data{
        //absolute value
        real var1 [nobs];
        real var2 [nobs];
        vector[nobs] var0;
        for (i in 1:nobs){
                var1[i] = fabs(tau_1[i]);
                var2[i] = fabs(tau_2[i]);
                var0[i] = fabs(tau_0[i]);
                }
        }
parameters {
        //actual values
        real x_real1[nobs];
        real x_real2[nobs];
        //desired parameters
        real mu_beta0;
        real mu_beta1;
        real mu_beta2;
        real i_beta0 [nobs];
        real i_beta1 [nobs];
        real i_beta2 [nobs];
        #real<lower=0> sd_beta0 [nobs];
        #real<lower=0> sd_beta1 [nobs];
        #real<lower=0> sd_beta2 [nobs];

    }
transformed parameters{
        //model formula
        vector[nobs] muy;
        for (i in 1:nobs) {
                muy[i] = i_beta0[i] + i_beta1[i] * x_real1[i] + i_beta2[i] * x_real2[i];

    }
    }
model {
       x_real1 ~ normal(0, 10);
       x1 ~ normal(x_real1, var1);
       x_real2 ~ normal(-1, 10);
       x2 ~ normal(x_real2, var2);
       y ~ normal(muy, var0);
       var0 ~ normal(1, var0);  // where var0 is the estimated standard deviation for y (which might be better estimated differently...)
       mu_beta0 ~ normal(1,10);
       mu_beta1 ~ normal(1,10);
       mu_beta2 ~ normal(1,10);
       #sd_beta0 ~ normal(1,10);
       #sd_beta1 ~ normal(1,10);
       #sd_beta2 ~ normal(1,10);
       for (i in 1:nobs){
                i_beta0[i] ~ normal(mu_beta0, 3);
                i_beta1[i] ~ normal(mu_beta1, 3);
                i_beta2[i] ~ normal(mu_beta2, 3);
                }
       }

(I took out the priors on the standard deviation of beta for now because that makes the model run basically 100 times slower, though I can add them back in if you think they are necessary). This model is coming out with disastorous nan values for Rhat and only 2 neff values for each parameter.

1 Like

Thanks for uploading the data! :)

Well, lets stick to your first model for now - Iā€™m actually not quite sure what you want to achieve with these other models, but letā€™s start simple. Maybe we can add more complexity later.

So, just to be explicit:

\begin{aligned} x_1^* &\sim \text{Normal}(x_1,\tau_1) \\ x_2^* &\sim \text{Normal}(x_2,\tau_2) \\ y &\sim \text{Normal}(\beta_0 + \beta_1 x_1^* + \beta_2 x_2^*,\tau_0), \end{aligned}

where x_1,x_2,y, and \tau_0,\tau_1,\tau_2 are observed. The predictors x_1^*,x_2^* are inferred, i.e. theyā€™re assumed to be normally distributed around the observed x with a random error with standard deviation \tau. The outcome is assumed to be normally distributed with known standard deviation. The regression coefficients \beta_0,\beta_1,\beta_2 are pooled (i.e. not hierarchical).

Thatā€™s basically the model, that you posted in your initial post, right? I re-wrote the code slightly (mostly vectorization and non-centered parameterization on x_1^*,x_2^*). I also dropped the transformed data block, since \tau > 0 for all observed \tau anyways (even if not, thisā€™ll be checked by the <lower=0> restriction in the data block). I also changed the priors on beta, becauseā€¦ idk, they were not that informative anyways, but you can of course change them back/adjust.

This is the model:

data {
  int<lower=0> nobs;
  vector[nobs] y;
  
  vector<lower=0>[nobs] tau0;
  vector<lower=0>[nobs] tau1;
  vector<lower=0>[nobs] tau2;

  vector[nobs] x1;
  vector[nobs] x2;
  
}
parameters {
  vector[nobs] x1_tilde;
  vector[nobs] x2_tilde;

  real beta0;
  real beta1;
  real beta2;
}
transformed parameters{
  vector[nobs] x1_ast = x1 + (tau1 .* x1_tilde);
  vector[nobs] x2_ast = x2 + (tau2 .* x2_tilde);
  vector[nobs] mu = beta0 + beta1*x1_ast + beta2*x2_ast;
}
model {
  x1_tilde ~ std_normal();
  x2_tilde ~ std_normal();
  
  y ~ normal(mu, tau0);
  
  beta0 ~ normal(0,10);
  beta1 ~ normal(0,10);
  beta2 ~ normal(0,10);
}

This runs relatively smoothly.

> rstan::get_elapsed_time(posterior)
        warmup sample
chain:1  9.836 10.381
chain:2 10.786 10.072
chain:3  9.543 10.104
chain:4  9.773 10.116

Results (excluding x1_tilde, x2_tilde, x1_ast, x2_ast, mu) are:

> print(posterior, c("x1_tilde", "x2_tilde", "x1_ast", "x2_ast", "mu"), include = FALSE)
Inference for Stan model: stan_forum_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta0   -4.62    0.04  0.58   -5.86   -4.97   -4.60   -4.23   -3.53   246 1.01
beta1   -3.26    0.16  2.17   -7.45   -4.73   -3.23   -1.83    0.97   189 1.02
beta2   -7.88    0.03  0.56   -9.07   -8.21   -7.85   -7.50   -6.90   327 1.01
lp__  -448.76    0.49 12.02 -473.13 -456.93 -448.51 -440.42 -426.26   592 1.00

(I have no clue if these values make sense. Sorry, I donā€™t really understand the science hereā€¦)

Diagnostics look fine, but n_eff is fairly low for the betas. Iā€™m guessing that making the model more complex will be difficult. Do it step by step and try to see (if at all) where stuff breaks.

That being saidā€¦ I tried to run the model with \tau_0 as parameter, with prior \tau_0 \sim \text{Exponential}(1) and it ran much betterā€¦ Hereā€™s the model code and results:

data {
  int<lower=0> nobs;
  vector[nobs] y;
  
  vector<lower=0>[nobs] tau_1;
  vector<lower=0>[nobs] tau_2;
  
  vector[nobs] x1;
  vector[nobs] x2;
  
}
parameters {
  vector[nobs] x1_tilde;
  vector[nobs] x2_tilde;
  
  real<lower=0> tau_0;
  
  real beta0;
  real beta1;
  real beta2;
}
transformed parameters{
  vector[nobs] x1_ast = x1 + (tau_1 .* x1_tilde);
  vector[nobs] x2_ast = x2 + (tau_2 .* x2_tilde);
  vector[nobs] mu = beta0 + beta1*x1_ast + beta2*x2_ast;
}
model {
  x1_tilde ~ std_normal();
  x2_tilde ~ std_normal();
  
  y ~ normal(mu, tau_0);
  
  tau_0 ~ exponential(1);
  
  beta0 ~ normal(0,10);
  beta1 ~ normal(0,10);
  beta2 ~ normal(0,10);
}

> get_elapsed_time(posterior_2)
        warmup sample
chain:1  0.876  0.553
chain:2  0.909  0.629
chain:3  0.976  0.420
chain:4  0.923  0.382
> print(posterior_2, c("x1_tilde", "x2_tilde", "x1_ast", "x2_ast", "mu"), include = FALSE)
Inference for Stan model: stan_forum_model-2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
tau_0    2.07    0.00  0.15    1.80    1.96    2.06    2.16    2.40  3811    1
beta0    0.11    0.01  0.57   -0.98   -0.27    0.11    0.50    1.24  1973    1
beta1   -7.99    0.05  2.53  -12.97   -9.70   -8.00   -6.27   -2.99  2453    1
beta2   -0.19    0.01  0.31   -0.82   -0.39   -0.21    0.01    0.40  1064    1
lp__  -224.71    0.27 10.21 -245.76 -231.34 -224.32 -217.46 -205.79  1445    1

Betas are all different nowā€¦ Hmmā€¦ Idk whatā€™s up here, but the estimated \tau_0 is much higher than the ā€œobservedā€ values for \tau_0 in the first model.

Anywaysā€¦ I hope that gives you some starting point. Feel free to ask, if anything in the code is unclear. :)

Cheers!

Ok the paper you refer to gives this modelā€¦

It looks like you want to estimate the grey circles. And it also looks like you want to implement what they do in equation 6 (if so, then there seems to be a log-link missing for x1 Ā§ in your code as well as its joint prior with omega-dash, which they tell should control for some mysterious correlation of parameters :))

Anyway, it seems that there is a degree of complexity which is necessary to prevent the things you plot from happeningā€¦ and I think this goes beyond my scientific insight.

Just some things I noticed in the data you provide which I would worry about:
tau1 is basically zero, up to the 5th decimal place, I would consider this as absent variance (but I am not an astronomer) :))
tau1 and tau2 both correlate with y (meaning the bigger the measurement uncertainty, the bigger the y values. This should not be the case, as errors are always assumed to be independent from what the model tries to predict. Since the y-values are not truncated or censored at 0 (which could be one explanation) this is tough to handle I think.

Best, RenƩ

Hey! Good questions! Iā€™m afraid, I wonā€™t be able to give very good answers ā€“ Iā€™m by far not an expert on these kinds of models, but letā€™s seeā€¦

x+(tau .* x_tilde) is just \mathbf{x} + \mathbf{\tau} \circ \mathbf{\tilde{x}}, where \circ is the element-wise product. Really, this is just the vectorized form of x + \tau\tilde{x}, when \tilde{x} \sim \text{N}(0,1) and \tau > 0, then x^* = x + \tau\tilde{x} implies that x^* \sim \text{N}(x,\tau^2), where x is the mean and \tau^2 is the variance. These are mathematically equivalent, mut geometrically different, and thus this non-centered parametrization is sometimes preferable for HMC.

Where does tau1 come from? How can one measure an error or deviation term for just one data-point? And how should one be able to estimate the true x in reverse for each data point? (Maybe this is just an issue that stems from me not understanding stan code properly, if so, my apologies, but this either looks over- or underspecified.).

As I understand it, x is some measurement with known measurement error \tau. The real value, which I denoted x^* is unknown, but assuming that the measurement error is normally distributed, we can infer plausible values and work with those.

Why and how \tau is known is more of a scientific (in the stricter sense) question, than a statistical, I think.

Given you have tau, this implies that x1 was ā€˜cleansedā€™ by tau1 beforehand (or tau1 was somehow relatedly estimated as -potential or expected- residual of some form), then it seems that adding some sampling error, does not ā€˜cleanseā€™ x1, does it? Why should one want to formally ā€˜addā€™ an expected error to the x1 again afterwards? (Shouldnā€™t that be simply a matter of the error in the prediction, not in the value?) Depends on how to understand tau now, (if it is an actually measured error taken out of x then x1 and x2 should be more errorless (less biased) by definition, no? :)), and if it is an otherwise estimated error or standard deviation of errors, then why are there different tau for every single data point? Does this mean that each x1 and x2 data points are actually ā€˜meansā€™ of distributions?) But this could be me. I think, Samā€™s first approach seems a bit like one would take the uncertainty in the measurement into account, like in a latent common cause model, fair enough. But what do you expect to happen if:

The idea is not to get rid of the measurement error, but to incorporate it. Ultimately, we donā€™t know the real values, so we should take measurement error into account. Not putting it in could lead to biased results. Remember, that usually in a regression you take the data (x) as fixed, meaning you are certain, that they are ā€œtrueā€, while this approach introduces uncertainty in in the predictors.

Now, one could think about uncertainty about the observed measurement errors, but itā€™s easier to start simple ā€“ and I donā€™t know nearly enough about Astronomy to make judgments on that.

More importantly, if tau1 and tau2 and y actually positively correlate (of about r=.3 each), while x1, and x2 negatively correlated with y (with about -.3, and -.1). (x2 and tau2 also seem to correlate negatively). This means: if my interpretation of the formula above by Max (or the initial code by Sam) is in the direction of correct, then adding ā€˜more widely distributed errorā€™ should lead to biased samples of the x values, not to a reduction of itā€¦? I would suggest thinking this through again, but it seems not surprising to me that the model can not decide between positive and negative beta1 and beta2 values, although the complex statistical mechanics might require some more thinking.

Hmā€¦ as I said above, itā€™s more about accounting for error (in the sense of incorporating it), and not reducing it. Sure, estimation variance likely increases, but itā€™s likely less biased. Also, remember that a (linear) regression coefficient is essentially \beta x = \rho\dfrac{\sigma_y}{\sigma_x}(x-\mu_x), but we usually donā€™t estimate these quantities. Usually, we know x, but now we know \mu_x (denoted x in the code) and \sigma_x (thatā€™s tau). (I just saw that you can also out the x_tilde into the regression equation to get standardized regression coefficientsā€¦ interesting.)

Again, how we know these in this case is a question, that I canā€™t answer. And I actually also donā€™t know if all this is appropriate in this application.

What I wanted to contribute here is that this type of stuff usually runs better with non-centered parametrization (see above). Interesting questions, nonetheless. :)

EDIT: oh, you edited.

Yeah, I thought that I was not too convinced whether these were good questions :)). But your answers are ;).

Ok. Does this mean that the ā€˜trueā€™ x1 and x2 values are actually estimated for each observation i (in N stars).
When defining this with [nobs]?

vector[nobs] x1_tilde;

which then translating to something like this loop (for x1 for illustration)?

 for (n in 1:N)
    y[n] ~ normal(beta0 + beta1 * (x1[n]+(tau_1[n] .* x1_tilde[n])) , tau_0[n]);

(I actually learn stan on the fly while writing thisā€¦ nice)ā€¦ :)

The graphical model above would require this, but also, following their own (more complex/detailed paper the authors cite) they also provide hyper priors for the ā€˜hyper-trueā€¦ā€™ x1 estimatesā€¦ (if the mentioned equation 6 is the actual target formula); with this priorā€¦

where k denotes the dirichlet components in a mixture modelā€¦ (reallyā€¦), which looks ā€˜incredibly importantā€™, because I would have no idea why to come up with such complex priors (but I am still no astronomer), if ā€˜freeā€™ estimation would be able to do the job, so all this might play into the identification issues here.
What a complex model to start withā€¦ Sam. :)

Best, RenƩ

Ps. My main ā€˜problem understanding thisā€™ approach isā€¦ whether one can call x_tilde1 really an estimate, or whether one should call it part estimate, part simulation. The paper authors mention something like this. I am psychologist, I would never dream of inviting a participant to a study taking one target measure while also asking a participant about how certain this measure might be, for eventually simulating the range of likely target measures. Usually, I measure the target variable multiple times, and then infer both, its mean and its standard deviation from this, which then I would call a likelihood estimation. (Now I think I should once try to take the estimated mean and standard deviation as single values, to simulate the range of likely parametersā€¦ :); I wonder whether one could go straight to the ā€˜more basicā€™ data, if there are some,udnerlying the x1 and tau single observations?) And I would always prefer this if I can, because when ā€œsimulatingā€ (uncertainty) in this one-target measure + deviation range, the model is always some degrees of freedom to come up with (unlikely) values, not because they are close to the one-measured-x1, but because it fits the formula in some other way (depending on the other simulated values of e.g. x2). And I think this is also the reason why Sams model yields these weird posteriors, when not ā€œnailing downā€ all these parameter-interdependencies within this complex model architecture.

@Max_Mantei, this makes sense. Thank you so much for fixing this problem. I wouldā€™ve never thought of vectorizing the equation.

And that is correct, Tau is simply measurement error, inherent to the gaia satellite taking these measurements. The error has been calculated previously, so is effectively just another prior. I think maybe in more human sciences, it is hard for error to be quantified, so it usually isnā€™t included.

Also @ReneTwo, I think youā€™re right about that part of the function being responsible for the lopsided posteriors. Iā€™ll look into creating the gaussian mixture component, but itā€™s a bit out of my range of stan knowledge. I think when I started creating this, I was wanting to do simple models first, but Iā€™ll try to include it (maybe start a new thread if there are any problems with it). Thereā€™s another variable Iā€™ll also try to include in the relation that might explain but the dependences amongst parameters, but weā€™ll seeā€¦

Vectorizing just made it run a bit fast. The non-centered parameterization made it run well. There is a fair bit about re-parameterization in the userā€™s guide and there is also a great case study by Michael Betancourt.

I think maybe in more human sciences, it is hard for error to be quantified, so it usually isnā€™t included.

This! Iā€™m coming from Economics/Pol Sci. and there is so much measurement error everywhere. The thing is, that we often have no clue what it could look like - so much that even the thought of possibly known measurement error (from instruments, sensors, etc.) kinda seems outlandish to us (at least me). Well this is astronomy, so outlandish fits.

Now that I see the model from that paper, it really is a multivariate model! My guess would be that getting the MultiNormal to fit will be easier that getting the mixture to work, but I could be wrong. On the other hand, itā€™s really just a two component mixture (k = 1,2) of Iā€™m not mistaken, so thatā€™s actually manageable (search for log_mix in the userā€™s guide). A Dirichlet prior on a two component mixture would actually be just a Beta prior on phi. So thatā€™s that.

@Max_Mantei
You actually sound like a statistician (no offense, ā€¦ a good one I mean), when speaking about errors. Also ā€¦ , you just made me appreciate ā€˜dummyā€™ over ā€˜contrast codingā€™ of experimental designs for the first time ā€¦ (And in psychology, I have never seen somebody thinking about something like a Gaussian mixture prior with even one componentā€¦ ;) ā€¦maybe onceā€¦)

@Sam-2727
Error usually -is- included (in Psychology), but mostly in the conclusionsā€¦ (Look at the recent large scale replication attempts of important empirical effectsā€¦ it looks bad so far; same for medicine and physicsā€¦)
Back to the topic, :) I think this thread is already a good start. I guess, one gets used to the domain specific model assumptions. And understanding theories always means understanding their constraints (i.e. where they end), in any domain, which is basically equivalent of being able to define meaningful priors.

Have a nice weekend.

1 Like

I chuckled a bit. :)