Reparameterize in a hierarchical model

Hello,

I would like to ask whether the following is a correct way to perform a reparameterization in a hierarchical model?

I want to code a hierarchical model of reinforcement learning as follows (the actual model is more complex so I am leaving in here just the relevant parts I want to ask about). Here the w is a weight between (0,1); the betas should be positive, but I am also putting an upper bound because they typically do not go higher than that.

parameters{   
	//hyperparameters
	real mu_beta;
	real mu_w;
	real<lower = 0> sigma_beta;
	real<lower = 0> sigma_w;

	// Subject level raw parameters 
	vector[N] beta_raw;
	vector[N] w_raw;
}

transformed parameters{
	vector<lower = 0, upper = 10>[N] beta;
	vector<lower = 0, upper = 1>[N] w;
	
	for (i in 1:N){
		beta[i] = Phi_approx(mu_beta + sigma_beta * beta_raw[i])* 10;
		w[i] = Phi_approx(mu_w + sigma_w * w_raw[i]);
	}
}

model{
	//Hyper parameters 
	mu_beta ~ normal(0,10);   
	mu_w ~ normal(0,10);
	sigma_beta ~ cauchy(0,2.5);
	sigma_w ~ cauchy(0,2.5);
	
	// Individual parameters (raw)
	beta_raw ~ normal(0,1);
	w_raw ~ normal(0,1);

	// For each subject
	for (i in 1:N){
			real Qsum;
			for (t in 1:numtrials){
				Qsum = beta[n]*w[n]*Q1[i,t] + beta[n]*(1-w[n])*Q2[i,t];
				choice[i,t] ~ categorical_logit(Qsum);					
				// code to update Q1 and Q2
			}	
		
	}
}

Now I would like to try a reparameterization, such that theta1 = betaw, theta = beta(1-w). The individual beta and w would be recovered as beta = theta1 + theta2, w = theta1/(theta1+theta2). I don’t know if I can recover the group level mu_beta and mu_w.
The main parameters I am interested in are the individual beta and w, and also the group level mu_beta, mu_w. I am wondering if it is possible to reparameterize and be able to obtain these information.
The following is the code I have come up with, and I would appreciate if someone can point out if I am doing anything wrong or if there are other ways to do it. Thanks!

parameters{   
	//hyperparameters
	real mu_beta;
	real mu_w;
	real mu_theta1;
	real mu_theta2;
	
	real<lower = 0> sigma_beta;
	real<lower = 0> sigma_w;
	real<lower = 0> sigma_theta1;
	real<lower = 0> sigma_theta2;
	
	// Subject level raw parameters 
	vector[N] beta_raw;
	vector[N] w_raw;
	vector[N] theta1_raw;
	vector[N] theta2_raw;

}

transformed parameters{
	vector<lower = 0>[N] theta1;
	vector<lower = 0>[N] theta2;
	vector<lower = 0, upper = 10>[N] beta;
	vector<lower = 0, upper = 1>[N] w;
	
	for (i in 1:N){
		theta1[i] = Phi_approx(mu_theta1 + sigma_theta1 * theta1_raw[i])* 10;
		theta2[i] = Phi_approx(mu_theta2 + sigma_theta2 * theta2_raw[i])* 10;
		beta[i] = theta1[i] + theta2[i];
		w[i] = theta1[i]/(theta1[i] + theta2[i]);
	}

}

model{
	//Hyper parameters 
	mu_theta1 ~ normal(0,10);   
	mu_theta2 ~ normal(0,10);
	sigma_theta1 ~ cauchy(0,2.5);
	sigma_theta2 ~ cauchy(0,2.5);
	
	// Individual parameters (raw)
	theta1_raw ~ normal(0,1);
	theta2_raw ~ normal(0,1);

	
	// For each subject
	for (i in 1:N){
			real Qsum;
			for (t in 1:numtrials){
				Qsum = beta[n]*w[n]*Q1[i,t] + beta[n]*(1-w[n])*Q2[i,t];
				choice[i,t] ~ categorical_logit(Qsum);
                            // code to update Q1 and Q2						
			}	  		
	}
}

What is Phi_approx? If it wasn’t there, then I’d say from what you’ve said the second model seems fine. With it there, it seems like the two models are kinda different.

Is there something going wrong in this model that makes you want to do this reparameterization?

It’s a built-in approximation to Phi (inverse CDF of unit normal) that approximates with a scaled logistic function. It’s close and much more efficient.

1 Like

It’s a built-in approximation to Phi

Tnx, I shoulda checked the manual.

I’m not sure I was thinking clearly when I made my first response either. These are just different models. Even if Phi is something super nice. I’ve stared at things awhile and I’m having trouble figuring out exactly what the goals are.

For instance, why isn’t w just N simplexes? What are the importances of these priors? Is there preference for them to be in terms of theta, or beta, or …? Since Phi is an inverse CDF, you should have the Jacobian adjustment for your transformation if you need it.

What led you to this reparameterization? I think I’m getting lost just trying to push the code around. It could be a good way to reparameterize things, but there are probly other options and ways to think about things too.

Thanks a lot for the response.

The goal is to obtain the group level hyperparameters mu_beta and mu_w, and also the single subject beta and w.

I am concerned in the original formulation there may be un-identifiability between beta and w, since they are multiplied in beta[n] * w[n] * Q1[i,t] + beta[n] * (1-w[n]) * Q2[i,t]. So I wanted to try this reparameterization which has been suggested in past research papers that used sth similar.

I think the point is to place weak priors on the group hyperparameters (mu_beta, mu_w…), and the individual parameters beta and w will be sampled from distributions specified from the group hyperparameters N(mu_beta, sigma_beta). Preferrably the priors should be placed in terms of beta and w. I suppose if I transform beta from the theta’s and sample from beta, I would need a Jacobian adjustment?
The code may look confusing because I am also using non-centered parameterization (the Matt trick) hence the transformations on the raw parameters like beta_raw and specifying beta_raw ~ N(0,1).

I used the Phi transform because I needed to constrain the parameters to (0,1) for w and (0,10) for beta or theta (the upper bound 10 on this probably should be a soft constraint using priors, but I need to figure out how to do that). Would you suggest using sth else?

You suggested using simplex for w, so simplex[1] w[n] to specify a simplex for each individual w? Should I specify the group level hyperparameter to be alpha, alpha ~ some distribution, and the array of w ~ dirichlet(alpha) ?

Please let me know if any clarification is needed and thanks a lot for your help!

I used the Phi transform because I needed to constrain the parameters to (0,1) for w and (0,10)

It’s probly best to start out just using the Stan constraints to do stuff like this, if for no other reason that readability. For instance, to say a parameter is between zero and ten:

real<lower=0.0, upper=10.0> beta;

And it’s clear that this should be a parameter uniform from 0.0, to 10.0. Hopefully the sampler can achieve that as well :D. It’s hard to reason about parameters if they go through too many transforms.

And since w is a 2 component simplex, just start with that:

parameters {
  simplex[2] w[N];
  vector[N] beta;
}

You can use Dirichlet distributions for the prior on w and it’ll probably be easier to think about what’s going on (do you strongly believe you know nothing about your simplex? Do you believe it should be near 50/50? Do you believe it should be near 20/80? – this is what the Dirichlet prior will let you specify).

As far as identifiabilities, there could be some, but get that answer from the sampler. If Stan is giving “maximum treedepth exceeded” errors (you can also get this specific information from shinystan) or you look at the pairplots in the posterior and find super correlated parameters, then it’ll be worth worrying about identifiabilities. But if everything just runs and your posteriors look nice and uncorrelated and all the diagnostics pass, you’re good!

And real quick, this looks suspicious to me:

real Qsum;
...
choice[i,t] ~ categorical_logit(Qsum);

I think Qsum needs to have multiple entries for this to make sense. From the docs,

categorical_lpmf(ints y | vector theta)
The log categorical probability mass function with outcome(s) y in 1 : N given N-vector of outcome probabilities theta.

categorical_logit is just the same but the probabilities come from a vector that is passed through softmax. If there’s only one thing passed in, then it’ll always be softmaxed to a single element vector of probability one!

Hope that helps!

Thanks a lot for the prompt reply! It is very helpful.

However, I have a question about using Stan constraints. Can I set the constraints on an individual’s parameter if I need to sample it from a distribution specified from hyperpriors? Suppose that I have

parameters{   
	//hyperparameters
	real mu;
	real<lower = 0> sigma;
	
    //individual parameters
    vector<lower = 0, upper = 10>[N] beta;
  }

model{
	//Hyper parameters 
	mu ~ normal(0,10);   
	sigma ~ cauchy(0,2.5);

	//Individual parameters
    beta ~ normal(mu, sigma);
}

I think in a sampling step, if beta falls outside of the constraint defined in the parameter block, I would encounter an error? That’s why I thought I had to use a transformation using inverse logit for this kind of hierarchical model to work out.

Thanks for the pointer about simplexes, I will try it out. If I think it is best for a prior to be near 50/50 (this is equivalent to a weight w = 0.5 ?), do I specify an alpha vector of [50 50] then?

And thanks for pointing out about Qsum, I coded it correctly as a vector in my code but must have made a mistake when I tried to whittle the code to sth simpler.

Nothing wrong with what you wrote. That’s perfectly fine, though the constraints do look a little weird given nothing else :D.

I think in a sampling step, if beta falls outside of the constraint defined in the parameter block

That’s true, but since you’ve defined the lower bound of beta to be zero and the upper to be 10, it won’t happen. The constraints in Stan are guarantees. The tldr; on this is that with Stan constraints, there are two parameter spaces. One is the constrained space we see, and the other is the unconstrained space the sampler is working on. Check out chapter “34. Transformations of Constrained Variables.” It has a good explanation of how this works. I don’t want to say anything too misleading :D.

But careful with the terminology here, there’s nothing being sampled :D. What beta ~ normal(mu, sigma); does is just increment the log density of the model (check 5.3. Sampling Statements – in this case it’s the same as target += normal_lpdf(beta | mu, sigma)). Stan uses this log density to explore the posterior.

If I think it is best for a prior to be near 50/50 (this is equivalent to a weight w = 0.5 ?), do I specify an alpha vector of [50 50] then?

Best way to get a handle on this is just plot the Dirichlet pdfs for a bunch of different parameters to see what’s going on. An alpha vector of [50 50] would be really confident w is near 0.5. However, [2.5 2.5] would also say w is near 0.5, it’d just be a much weaker prior.

Thank you very much for your help! Your suggestions have been very useful to me. :)

It’s almost never a good idea to use a 2-component simplex. At that point, you can just model the first component as a variable constrained to (0, 1) and there’s much less computational overhead. If you constrain a variable to (0, 1), it does a logit transform to unconstrained variables; coming back from unconstrained space (where Stan samples), we have inverse logit plus the Jacobian adjustment (the latter is necessary if you want the distribution to be uniform on (0, 1).

1 Like

@wzhong check out what Bob said here.

You should still be able to use Dirichlet prior on it still (if you like).

Just do:

parameters {
  real<lower=0.0, upper=1.0> w_base;
}
transformed parameters {
  vector[2] w = {w_base, 1.0 - w_base};
}
model {
  w ~ dirichlet(...);
}

You might get a warning about a Jacobian (since you’re using a sampling statement on a transformed parameter), but you should be good to go.

I see. Many thanks!

No reason to. If

simplex[2] theta;
theta ~ Dirichlet([a, b]');

then that’s equivalent to:

real<lower=0, upper=1> phi;
phi ~ Beta(a, b);
theta[1] = phi;
theta[2] = 1 - phi;

But you usually don’t need to ever define the simplex theta — you can just work with phi. You’ll just wind up converting:

y ~ categorical(theta);  // y = 1 success, y = 2 failure

to

z = 2 - y;  // z = 1 success, z = 0 failure
z ~ bernoulli(phi);

The arithmetic is for the different way we’re thinking of coding success and failure.

I should put an example in the manual.

1 Like