Extracting correlation from bivariate normal distribution

Hi

I’m new to Stan (and pyStan) and am having trouble with a simple example.

My data is two N(0,1) variables x and y, with correlation rho. For each variable I have N observations. I want to use the following model to infer the correlation. In the model I’ve defined a transformed parameter z, which is N(0,1) and uncorrelated to x.

import numpy as np
import pandas as pd
import pystan as pys
import scipy as sp

model_code = """
data {
    int<lower=0> N; // length of data set
    vector[N] x;    // data, N-length vector
    vector[N] y;    // data, N-length vector
}
parameters {
    // correlation parameter, this is to be inferred
    real<lower=-1.0,upper = 1.0> rho;
}
transformed parameters {
    // auxiliary variable, should be N(0,1) and uncorrelated with x 
    vector[N] z;
    z = (y - rho * x) / sqrt(1.0 - rho * rho);
}
model {
// prior distribution
    rho ~ uniform(-1.0, 1.0); 
// likelihood
// Is there an issue with using a transformed parameter in the ikelihoods?
    x ~ normal(0.0, 1.0);
    z ~ normal(0.0, 1.0);
}
"""

model = pys.StanModel(model_code=model_code)
"""
model = pys.StanModel(model_code=model_code)

I’m using this model as follows, but it is not converging to the correct value for the correlation rho. For a true rho of 0.6, the model is converging to 0.35 +/- 0.2

N = 1000

true_correl = 0.6
true_covar = np.array([[1, true_correl],[true_correl, 1]])
observations = sp.stats.multivariate_normal.rvs(np.array([0, 0]), true_covar, size=N)

data_df = pd.DataFrame(observations, columns=['X_obs', 'Y_obs'])
model_dat = {
    'N': N, 
    'x': data_df['X_obs'].values,
    'y': data_df['Y_obs'].values,
}
fit = model.sampling(data=model_dat, iter=5000, warmup=2500, chains=2)

mean_predicted_rho = np.average(fit.extract()['rho']) # gives 0.35 +/- 0.02
print mean_predicted_rho

I know I could to this by using a multi_normal instead, but this is a toy model that I’d like to extend with this structure so I’m keen to get this example working.

I’m not sure where I’ve gone wrong, so any advice would be greatly appreciated.

Thanks

The problem I see here is that z depends also additionally on rho, so the Jacobian is not constant. Setting a prior in the model block for a (linearly) transformed parameter would only be ok when the transformation is constant or depends on data only. See also The QR Decomposition For Regression Models.

Yeah, what @ermeel said about the change-of-variables is correct. But you should be able to deal with a bivariate normal distribution by composing a marginal normal distribution for x with a conditional normal distribution for y given x. See