Complicated Stan model with measurement error in x and y fitting poorly (radically edited with concrete running code)

I have rewritten my question to be more concrete. The goal is to model simple “factor exposures” in a way that accounts for systematic measurement error, both in x and y. At a high level, the market return is an average of returns of its constituents. A general aim in finance is to find the exposure of some factor to the market’s moves. That is, what is the beta such that factor = beta_factor * mkt + noise. I want to introduce the complexity that we know that factor and mkt are mismeasured during certain periods of time, and I want to account for that mismeasurement.

I think I have the right idea for a Bayesian model. Suppose the market is partitioned into different quality instruments, so that each factor is indexed by q. We observe:

matrix[Q, T] X; // returns across time, grouped by quality. There are Q qualities and T times.
vector[T] mkt; // market returns
vector[T] u_t; // vector encapsulating the scale of the error thought to be present in the market. Near zero
// for most times, around 1 or 2 for the financial crisis and similar periods.
matrix[Q, T] wts; // a vector of weights allowing one to recover mkt from X.

Define a function new_mkt(X, wts) which computes the market return given the factor returns, across times. The weights contain information about how each factor should be weighted (by size).

The informal model is:

X_true[q, 1:T] ~ N(0, sigma_1);
X[q, 1:T] ~ N(X_true, .000001 * g(u_t)); // g is some linear transformation of u_t. For now take it to be
// the constant function g = 1. I am unable to make even this
// work yet.
mkt_true = new_mkt(X_true, wts);
X_true[q, 1:T] ~ normal( beta[q] * mkt_true, sigma);

This seems to achieve what I want in that it properly relates the uncertainty in each factor to the uncertainty of the true market return, and ultimately defines the sought-after beta to be the relationship between these uncertain factors. Note that for now I am choosing to have a model for a very, very tiny measurement error. If I can get this model to work I then want to let g(u_t) allow the measurement error to be big only for a few select times.

The formal model I am running is the following:

functions {
  vector new_mkt(matrix quals, matrix wts){
    vector[cols(quals)] nmkt;
    matrix [rows(quals), cols(quals)] mod_quals;
    mod_quals = quals .* wts;
    for (i in 1:cols(mod_quals)){
      nmkt[i] = mean(mod_quals[1:rows(mod_quals),i]);
    }
    return nmkt;
  }
}

data {
  int<lower=1> TT;      // number of stacked returns
  int<lower=1> Q;      // number of qualities
  matrix[Q, TT] X;  // quality returns
  vector[TT] mkt;    // market return
  vector[TT] s_t;     // measure of measurement uncertainty
  matrix[Q, TT] wts;    // defines exact relationship of mkt as weighted average of X
}

parameters {
  vector [Q] beta;
  real<lower=0> sigma;
  matrix[Q, TT] X_true;     // unknown true factor returns
}

transformed parameters {
  vector[TT] mkt_true;      // unknown true market return, a simple function/transformation of X_true
  mkt_true = new_mkt(X_true, wts);
}

model {
  beta ~ normal(1,1);
  sigma ~ normal(0, .1);

  for(i in 1:Q) {
    X_true[i, 1:TT] ~ normal(0, 1);
  }

for(i in 1:Q) for(t in 1:TT) {
    X[i, t] ~ normal(X_true[i, t], .0000000001);         // for now, constraining X to be a nearly exact measurement
  }

  for(i in 1:Q) for(t in 1:TT) {
    X[i, t] ~ normal(beta[i] * mkt_true[t], sigma);
  }

}

Diagnostics

This model runs, but yields estimates for beta which are wrong (based on simpler models with zero measurement error), with extremely high Rhats for the beta terms and sigma and sigma2. Something is going on. Am I doing something strange or is this a genuinely difficult model for Stan?

This term: X[i, t] ~ normal(X_true[i, t], .0000000001); can make it difficult for the sampler to move around. Having lots of different parameters on lots of different scales will give the sampler trouble, so if these parameters are really tiny and other ones are large it can be hard for it to adapt. Maybe just make that term less constraint-like? Like put a standard deviation of 1 and see if life is better?

The thing that keeps holding me up on this is that in the informal model you have two likelihood terms for X_true. In the Stan code you have two separate terms for X. Doesn’t seem right, either way.

It seems that you’d write this model instead with something like:

X_true ~ normal(0, 1);
X ~ normal(X_true, measurement_precision);
mkt ~ normal(some_magic_function(beta * X_true), sigma);

edit: but I’m pretty hesitant on the prior on X_true. Seems kinda bold in this context

Thanks for your comment, but I think your suggestion doesn’t capture the fact that the measurement precision is a function of time (or more generally, if you like, a function of the index of the observation). Most observations have zero, or near zero, measurement error. I just want the model to capture the effect of measurement error at these few points. So while I did write X[i, t] ~ normal(X_true[i, t], .0000000001), what I want to move to is X[i, t] ~ normal(X_true[i, t], s_t[t]), where s_t is extremely small for most times/observations, but potentially large for several key times where we suspect measurement error.

But it sounds like you’re saying that this kind of model is inherently challenging for Stan, since for these dates the sampler has trouble moving around…maybe this is not possible?

What troubles you about the prior on X_true? The idea is that returns tend to have mean zero, very mild skew and tend to be almost stationary. That is, they are not terribly far away from Gaussian.

I forgot to address your comment about their being two likelihoods. I made an error. I’ve deleted the extraneous likelihood.

Now there’s no connection between X and X_true, so that doesn’t seem quite right.

Hmm, I’m kind of scared that you won’t have enough data to get very far with this. I guess s_t could be modeled hierarchically, but sparse things aren’t easy.

Well, it’s just you have, for each bit of data X, a parameter to infer, X_true. We don’t know the noise level on X, either. I’m thinking of X as covariates, just cause of the variable name. Given it’s one parameter per value of X, seems like the prior could bias things quite a bit. It won’t bias things if there are terms like X[i, t] ~ normal(X_true[i, t], .0000000001), cause then X_true will basically be X, but if the standard deviation is closer to one, then the prior will pull X_true down to zero.

If this problem was set up like a normal regression, is it that you’d write:

mkt_true[t] ~ normal(some_function(beta * X[t]), noise[t])

Or something like that?

And the issue is that occasionally this doesn’t really hold? Would it make sense to model mkt_true as a mixture? So like at some times this model holds, and then at other times craziness happens?

Sorry, you’re right my deletion was thoughtless. I’ve restored it, despite the issue.

I think you hit on what I am trying to achieve when you said that this might be modeled as a mixture. That is exactly right. Most of the time we want ordinary regression, but for something like 15% of the data I want to account for the measurement error unique to this time period. One could simply throw out the data entirely and do ordinary linear regression, but I want to make use of the data since the crisis has interesting features I don’t want to entirely ignore.

Perhaps the way out then is to do ordinary regression so that

for(t in 1:T) {
    mkt_true[t] ~ normal(mkt_true, sigma);
}


for(t in 1:T) {
  if(t not in recession) { 
    X[t] ~ normal(beta * mkt[t], sigma);
  }
  else {
   mkt_true = normal(mkt_true[t], sigma);
  }
}

[Note that my notation is crappy, but X is the dependent variable and mkt is the independent variable.]

I think this is a promising simplification…Do you have an explanation for why Stan struggles with sparsity? I would be interested in exploring this to get a firmer idea on what works and doesn’t work.

Oh, are there other covariates that aren’t included in this model yet?

Yeah, just play around with it like a regression for awhile and see whatcha get. You can add per-time point effects if you want to try to pick up on the craziness.

What’re the mkt_true = normal(mkt_true[t], sigma); terms doing?

Eh, at this point that’s now how I think of these things. If I’ve got a new model that I don’t understand, when the sampler struggles I take it as a stronger indication that there’s a problem to be fixed in the model than there’s a problem with the sampler.

Can be the samplers fault, but on my of my models it ends up being my fault :D.

It’s a standard kind of measurement model where the error scale is shared by all measurements at a given time, X[ , t]. The distribution of X[i, t] will then drift around based on s_t[t] and however it’s used downstream (usually as a regression predictor), whihc will also inform s_t[t]. If the error scales are large, you get a large measurement error adjustment in the posterior.

Yes, mixing scales is hard for most algorithms that rely on step sizes. I think you can non-center here to remove the prior correlation among the unobserved true values and put them on a unit scale:

X_true_std ~ normal(0, 1);
for (t in 1:T)
  X_true[ , t]  = X[, t] + s_t[t] * X_true_std[ , t];