# Estimating a posterior for a parameter which is only observed through (a part of a) linear transformation

I’m struggling a bit to translate what I think is a simple idea into a proper Stan model.

Essentially the idea is that I have some number of standard normal prior (unobserved) parameters. I get a single sample of a non-full-rank view of a noise-free linear transformation of those parameters. And then I want to update my priors given that information.

For example, here’s a very simple version: X and Y are two unobserved parameters each of which has a standard normal prior. The only quantity I measure is Z = X + Y. So I go and I take my (single) measurement of Z and now want to generate updated posterior distributions for X and Y in light of this information about Z. For example, if I found that Z = 3, I would expect the posteriors for both X and Y to be centered substantially above their initial 0 means.

That is the simple case. What I have in practice is the more general version:

I have a known matrix T which is square (shape is D x D) and of full rank (det(T) != 0 and/or T is invertible). I believe there is an (unobserved) D-dimensional vector, v, which has standard normal priors on each of its components. I take a single observation Z = R * T * v where T is the matrix above and v is my unobserved parameter vector. Here R is a non-square matrix which effectively tells me what (incomplete) portion of T * V. I.e. R is a matrix with fewer than D rows and exactly D columns in which each row consists of a single 1 and D-1 0's. Given that single observation of Z, my goal is to correctly update the posteriors for the components of v.

What’s the right way to approach this?

If the elements of \mathbf{v} are iid standard normal, then \mathbf{Z} is distributed multivariate normal

However, if \mathbf{R} is unknown but its row vectors are one-hot, which is not amenable to Hamiltonian Monte Carlo algorithms such as the one in Stan. Basically, you would have to marginalize over all possible \mathbf{R} and at that point, you have no unknowns.

Right, I was familiar with those properties. In my case, R is indeed known. It was more a question about how to set this up in Stan.

I believe I’ve gotten something working by introducing a “fake observation noise”. This feels very wrong, intuitively. But it seems to work, practically. The code below implements my “bad” idea. Can you offer a correction and/or a suggestion for how to use the properties of linear transforms of a multi-variate Gaussian to code this up a different way?

I guess the point here is that, in actuality, I have a purely deterministic (but rank reducing) transform from my unobserved parameters to my observed data. I want to use those observations to update my priors. What’s the best way to do that?

data {
int<lower=2> D; // full dimension
int<lower=1, upper=D-1> D_reduced; // dimension of observation
real obs[D_reduced]; // actual observed vector
matrix[D, D] T; // full rank, invertible transform
matrix[D_reduced, D] R; // each row is 1-hot "variable selector"
}

transformed data {
real<lower=0> obs_noise; // this feels wrong!  help!
obs_noise = 0.001  // this feels wrong!  help!
}

parameters {
vector[D] v; // unobserved
}

transformed parameters {
vector[D] t_v;
vector[D_reduced] r_t_v;

t_v = T * v;
r_t_v = R * t_v;
}

model {
v ~ normal(0, 1);
obs ~ normal(r_t_v, obs_noise);  // this feels wrong!  help!
}

If we back up and just consider the case of a linear regression model where the intercept and slope are known, it would be easy to “solve” for the value of the unknown error from the data.

In your case, \mathbf{v} basically plays the role of an error vector. If you know the outcome, know \mathbf{T} and know \mathbf{R}, then you can just “solve” for \mathbf{v} by multiplying the outcome by the inverse of the Cholesky factor of the covariance matrix, which in this case would be tcrossprod(R * T).

But it does not seem like a very satisfying model because of how deterministic it is.

I’m not sure I see it the same way. In this case v has strictly higher dimension than obs. So it is not possible to fully recover v deterministically from obs even though obs is deterministically derived from v.

Here’s an analogy for the use case I have in mind. Imagine you follow 100 stocks in the market. You note down their (%) moves each day for a while and then you run principal component analysis on the moves. That gives you 100 principal component “shapes” that (in decreasing order of usefulness) describe the way the 100 stocks “tend” to move. In fact, just to be slick, let’s write those 100 shapes in a matrix (T) where each column is a principal component and each column is scaled to it’s appropriate size in the historical sample. If you expressed the market moves in units of those principal components, you would find that, through time, you would find that the distribution of each component of that decomposition would be (in PC space) standard normal. Good, great, grand.

Why is this useful? Say I want to understand what market movements (across the full collection of 100 stocks) is “reasonable” given that I know how much, say, 3 of the stocks actually moved. I.e. I want to tell you how much 3 stocks move and then generate “reasonable” joint distributions of the possible moves in all 100 stocks.

In that context, I would appeal to PCA again. Before I tell you the moves in those specific 3 stocks, I would say that the prior on the “amount” of each PC in a market move would be 100 independent standard normals (by construction…that’s the whole point of that particular PC basis). But after observing where those 3 stocks are in my scenario, you should be able to update a new joint distribution on the amounts of all 100 PC’s that could make up that scenario.

So in this context, T is the matrix of scaled principal components. R is a less-than-full rank “sampling” matrix that tells you which stocks you actually “lock” as known. obs is the posited return of the 3 known stocks. Using my code above, I am able to get posterior distributions on all the components of v and as a result can generate posterior distributions for all 100 stocks (T * v). Everything behaves exactly as expected. But it feels like a hack.

This isn’t a perfectly deterministic inversion (because the dimensionality doesn’t allow it). But it’s not intractible either (no problem of infinite numbers of “equally good” solutions) because the priors regularize the problem. The problem at hand is just the generalization of me saying "if I tell you that Z = A + B and that A and B have independent standard normal priors, then if if I subsequently tell you that Z = 1 your new posterior for A and B should become a joint normal with mean 1 for each and perfect anti-correlation.

Does that make sense? If so, what’s the best way to get Stan to do this without my hack?

OK. If there are more elements to \mathbf{v} than there are in the outcome, then instead of finding points deterministically, you might get a line, plane, etc. where the points have to line.

Suppose you somehow partition \mathbf{v} = \begin{bmatrix}\mathbf{v}_A \\ \mathbf{v}_B\end{bmatrix}. Is it the case that given \mathbf{v}_A and the outcome that you could solve for \mathbf{v}_B? If so, then \mathbf{v}_A is your parameter vector that iid standard normal priors and \mathbf{v}_B is a generated quantity.

I don’t believe such a partition is possible because T isn’t just “blocks”. Even in the simplest example where Z = A + B and you just observe Z, you need to have a clearly stated prior for both A and B and you need to update both after observing Z.

Your comment about a line/plane/etc is spot on. The prior starts off by putting a distribution on a D dimensional parameter space. Observing R * T * v (which has less than D dimensions, say D_reduced) tells you that you are now only concerned with the D_reduced dimensional subspace where your data can exist. All other regions of the full D dimensional space are now “ruled out”. So you want to renormalize your prior density: the new density is 0 everywhere outside of the new subspace and is grossed up inside the subspace the total probability equal to 1.

I was hoping this was a thing that was expressible quite naturally in Stan. But am I off base?

Many-to-fewer transformations are not so easy to deal with in a HMC framework.

What you might be able to do is declare a vector in the parameters block whose size is equal to the difference between D and D_reduced. Then in the transformed parameters block, use the algebraic solver to solve for the complimentary vector of size D_reduced that satisfies your constraints. If you put iid standard normal priors on the former vector, you would induce a distribution over the latter vector.

Any chance you could show a small example of that?

Truly appreciate your time on this.

functions {
vector equations(vector v_pinned, vector v_free, real[] x_r, int[] x_i) {
int D_reduced = x_i;
int D = x_i;
matrix[D_reduced, D] RT;
int pos = D_reduced + 1;
for (j in 1:D) for (i in 1:D_reduced) {
RT[i,j] = x_r[pos];
pos += 1;
}
return obs - RT * append_row(v_free, v_pinned);
}
}
data {
int<lower=2> D; // full dimension
int<lower=1, upper=D-1> D_reduced; // dimension of observation
vector[D_reduced] obs; // actual observed vector
matrix[D, D] T; // full rank, invertible transform
matrix[D_reduced, D] R; // each row is 1-hot "variable selector"
}

transformed data {
matrix[D_reduced, D] RT = R * T;
vector[D - D_reduced] zeros = rep_vector(0, D - D_reduced);
int x_i = {D_reduced, D};
real x_r[D_reduced + D_reduced * D];
int pos = D_reduced + 1;
for (i in 1:rows(obs)) x_r[i] = obs[i];
for (j in 1:cols(RT)) for (i in 1:rows(RT)) {
x_r[pos] = RT[i,j];
pos += 1;
}
}

parameters {
vector[D_reduced] v_free;
}
transformed parameters {
vector[D] v_pinned = algebra_solver(equations, zeros, v_free, x_r, x_i);
}
model {
v_free ~ normal(0, 1);
v_pinned ~ normal(0, 1); // Jacobian is constant
}