Hierachical hierachical model: possibly a transformed parameters + jacobian issue

Hi

I’m trying to fit a bayesian model to do an analysis similar to Fusaroli et al. (2012): Coming to Terms: Quantifying the Benefits of Linguistic Coordination.
I’ve got data from a similar experiment where dyads complete a forced-choice task individually and together at various stimulus intensity levels, and with various stimuli (different vowels in my case).
In the paper above, they fitted psychometric models for individuals and dyads separately, and compared the estimated slopes to quantify the benefit they got from working together as a dyad compared to working individually.
Finally, they find that collective benefit is predicted by various features of the conversation between the dyad partners.

I’d like to combine all of this in stan in order to carry the uncertainty all the way through.
So - I’ve got two identical hierachical (dyad and stimulus) psychometric models, one fitting on the individual data, another on the dyad data.
Then in the transformed parameters block, I have basically, for each dyad:

collective_benefit = inv_logit(dyad slope) / inv_logit(individual slope);

Finally, I’d like to put a linear model on collective_benefit.

collective_benefit ~ normal(mu, sigma);
mu = alpha + beta * data;
[…]

It always gives me Jacobian warnings, but I can get as far as modelling it like a linear regression with just an intercept.

But as soon as I try to add (z-scaled) predictors, it spits out hundreds of errors and contains no samples.

Rejecting initial value: Error evaluating the log probability at the initial value.

Reading through the mailing list led me to the Change of variables vs transformations section of the reference manual, and as far as I can tell I’m doing a change of variables on collective benefit.
But I’m not sure I’m reading it right, and even then I have no idea how to get to a “log absolute determinant of the Jacobian of the transform” (in over my head here).

I also wondered if there was simply too much overlap in the collective benefit for the different dyads to support putting a model on it, but based on the posterior of the intercept-only model (see dropbox at the bottom), I don’t think that’s a problem.

Help, please?

Model below:

data{
// number of data points, groups, and vowels
int<lower=1> N;
int<lower=1> N_group;
int<lower=1> N_vowel;

// binomial data for the two psychometric functions
int k_ind[N];
int k_gro[N];
int n_ind[N];
int n_gro[N];

// input data
int group[N];
int intensity[N];
int vowel[N];

// collective benefit model
real local_confidence[N_group];
}

parameters{

// specify var-covar matrices
vector[2] gv_group[N_group];
vector[2] gv_vowel[N_vowel];
vector[2] iv_group[N_group];
vector[2] iv_vowel[N_vowel];

corr_matrix[2] gRho_group;
corr_matrix[2] gRho_vowel;
corr_matrix[2] iRho_group;
corr_matrix[2] iRho_vowel;

// scale parameters for the random effects
vector<lower=0>[2] gsigma_group;
vector<lower=0>[2] gsigma_vowel;
vector<lower=0>[2] isigma_group;
vector<lower=0>[2] isigma_vowel;

// main effects
real ga;
real gb;
real ia;
real ib;

// =========================================================
// parameters for the collective benefit model
real<lower=0> collective_sigma;
real col_a;
real col_b_confidence;
}

transformed parameters{
real<lower=0> collective_benefit[N_group];
for ( k in 1:N_group ) {
collective_benefit[k] =
inv_logit(gv_group[k,2] * gsigma_group[2]) // group slope
/ inv_logit(iv_group[k,2] * isigma_group[2]); // individual slope
}
}

model{
// psychometric model for individuals and groups
// logit(theta) = A + B * stimulus intensity
vector[N] gtheta;
vector[N] gA;
vector[N] gB;

vector[N] itheta;
vector[N] iA;
vector[N] iB;

real collective_mu[N];

// priors for the “main effects” in the psychometric models
ia ~ normal(0, 1);
ib ~ normal(0, 1);
ga ~ normal(0, 1);
gb ~ normal(0, 1);

// correlations of the random effects
gv_group ~ multi_normal(rep_vector(0,2), gRho_group);
gv_vowel ~ multi_normal(rep_vector(0,2), gRho_vowel);
iv_group ~ multi_normal(rep_vector(0,2), iRho_group);
iv_vowel ~ multi_normal(rep_vector(0,2), iRho_vowel);

gRho_group ~ lkj_corr( 4 );
gRho_vowel ~ lkj_corr( 4 );
iRho_group ~ lkj_corr( 4 );
iRho_vowel ~ lkj_corr( 4 );

// scale of the random effects
gsigma_group ~ cauchy( 0 , 2 );
gsigma_vowel ~ cauchy( 0 , 2 );
isigma_group ~ cauchy( 0 , 2 );
isigma_vowel ~ cauchy( 0 , 2 );

// psychometric function for both individuals and groups
for ( i in 1:N ) {
gA[i] = ga + gv_vowel[vowel[i],1] * gsigma_vowel[1] + gv_group[group[i],1] * gsigma_group[1];
gB[i] = gb + gv_vowel[vowel[i],2] * gsigma_vowel[2] + gv_group[group[i],2] * gsigma_group[2];
gtheta[i] = gA[i] + gB[i] * intensity[i];

   iA[i] = ia + iv_vowel[vowel[i],1] * isigma_vowel[1] + iv_group[group[i],1] * isigma_group[1]; 
   iB[i] = gb + iv_vowel[vowel[i],2] * isigma_vowel[2] + iv_group[group[i],2] * isigma_group[2];
   itheta[i] =  iA[i] + iB[i] * intensity[i];

}

k_gro ~ binomial_logit(n_gro, gtheta);
k_ind ~ binomial_logit(n_ind, itheta);

// here comes the collective benefit stuff

for (k in 1:N_group) {
collective_mu[k] = col_a + col_b_confidence * local_confidence[k];
}

col_a ~ normal(1,1);
col_b_confidence ~ normal(0,1);
collective_sigma ~ cauchy(0,2);

collective_benefit ~ normal(collective_mu, collective_sigma);

// Specifying only an intercept works, but …
// collective_benefit ~ normal(col_a, collective_sigma);
}

data on dropbox

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
StanHeaders_2.15.0-1
rstan_2.15.1
clang version 4.0.0-1ubuntu1

Thanks in advance

  • Malte
1 Like

You are trying to build a model with

f(parameter) ~ dist_lpdf(other parameters);

which is causing the warning. In this case the warning is important because you do in fact need the Jacobian determinant, | df / d(params) | to ensure that you have the right probability density function.

To figure out how you do the Jacobian correction you’ll want to brush up on your calculus and basics of probability. Alternatively, you can avoid the math by building a generative model where there are no transforming functions. This will require some work to redo what you have but it will have the huge side effect of yielding a model that is better posed and likely easier to fit.

Thanks, that at least confirmed what my problem is.
I looked into the Jacobian, but I’m not sure I understand what to (try to) take the Jacobian of.
If I understand it correctly, the Jacobian is the matrix of derivatives of all the combinations of n functions and n parameters.
I think I only have one function
f(a,b) =
and the two parameters, here a and b (the dyad and individual slopes in the model).
But if I take the partial derivatives of that and put those in a matrix, it obviously doesn’t make a square matrix which I’d need for the determinant …

So I’m doing / understanding something wrong, but what?

Okay, I tried a little more and found by luck on the german wikipedia that (sometimes?) to get something similar to the jacobian determinant if you have a non-square matrix, you can multiply it by its inverse first.
https://de.wikipedia.org/wiki/Gramsche_Determinante

Now I have in my model block:


for (k in 1:N_group) {

    a = gv_group[k,2] * gsigma_group[2];
        b = iv_group[k,2] * isigma_group[2];

        // df / da  =  (exp(b)+1) * exp(a-b) / ((exp(a)+1)^2)
        J[1,1] = (exp(b)+1) * exp(a-b) / ((exp(a)+1)^2);
        // df / db  =  -(exp(a-b)/(exp(a)+1))
        J[1,2] = -(exp(a-b)/(exp(a)+1));
        target += log(fabs(determinant(tcrossprod(J))));
}

But I’m still getting the same Error evaluating the log probability errors, so that’s probably still wrong …

Edit: fixed df / db

collective_benefit[k] =
inv_logit(gv_group[k,2] * gsigma_group[2]) // group slope
/ inv_logit(iv_group[k,2] * isigma_group[2]); // individual slope
collective_benefit ~ normal(collective_mu, collective_sigma);

What makes you sure that collective_benefit is normal distributed?

Andre

Andre:
Default assumption for linear regressions that I don’t hold on to too strongly.
In fact, because it’s a ratio, it’s constrained to be positive, so there is likely a better distribution, have any ideas? log-normal?

Maltelau:
I wonder why you make the transform this way and then run into jacobian adjustments to solve.
Your transformation is the first of a kind I see, coded in this way. It can be coded half-normal.
In that case you would have to constrain with lower=0. With:

collective_benefit ~ normal(collective_mu, collective_sigma);
you could write:
(collective_benefit - collective_mu) / collective_sigma ~ normal(0, 1);

It doesn’t look what I guess your indention is.

Andre

The problem with rejecting initial values is that you need to declare your constrained parameters with appropriate constraints. Every value of the parameters that satisfies the declared constraints should have a positive density (finite log density). For scale parameters, you need to include <lower = 0> in the declarations as they have to be positive. Similarly, gtheta needs to be declared with <lower=0, upper=1> because it’s a probability.

Everything will be much more efficient if you use the Cholesky factor parameterization of the correlation matrix (see the regression chapter of the manual). You can also vectorize this

gA[i] = ga + gv_vowel[vowel[i],1] * gsigma_vowel[1] + gv_group[group[i],1] * gsigma_group[1]; 

as

gA = ga + gv_vowel[vowel, 1] * gsigma_vowel[1] + gv_group[group, 1] * gsigma_group[1];

and similarly for the other lines.

For the Jacobian, you need something that’s going to produce a normalizable distribution (it doesn’t have to be normalized, just normalizable). There’s a discussion in the manual of the problems you run into if you do something like:

real a;
real b;
...
(a + b) ~ normal(0, 1);

It doesn’t produce a proper posterior. You’re going to run into the same function with other functions f(a, b) if they don’t wind up putting proper distributions on things.

But as Michael says, I would highly recommend trying to reformulate the model generatively. I don’t understand enough about the model to see why you’re trying to put a distribution on ratio a/b — it’ll run into the same problems when multiple pairs (a, b) produce the same output (e.g., (3,3) and (-1,-1)).

P.S. I read the abstract for the Fusaroli et al. paper you cited and I have to say that I love this topic. I used to work on semantics and teach both psycholinguistics and philosophy of language back when I was a professor and most recently worked on an empirical paper on word meaning. I always thought all the formal semantics missed the point of word meaning, and only reconciled myself with my own field when I realized it was really just syntax, and not really semantics. All the formal semantics theories were just naive Platonism in my opinion (sure, they could wrap it up in possible worlds a la Lewis and Stalnaker, but it was missing the critical cognitive and communicative aspect of language). This is mainly why I got out of linguistics—I think it took the wrong turn with Aspects when Chomsky introduced the competence/performance dichotomy. It also reminded me of Peter Ludlow’s book on the dynamic lexicon to which I was introduced at a neat psycholinguistics/semantics/pragmatics conference that Matthew Stone organized at Rutgers a few years ago.

You can see where the model is non-generative from the plate diagram—you have it as both a deterministic node (defined in terms of the dyad inverse logit things) and as a stochastic node sampled from a distribution.

The whole purpose of dividing the two slopes is just to get a comparison that I can use in the final part of the model. And since they divided in the Fusaroli et al paper, I think I can use their results as empirical priors (with the caveat that it’s a different sensory modality) if I do the same. But if I can get a - b or sth working and not a / b I’d be fine with that too, but it also hasn’t worked so far.

I tried turning it into a generative model by putting priors on the individual slope and the collective benefit and putting the group slope on the left hand side of the transformation statement.
I vectorized what I could, looked at the bounds (I don’t think gtheta and itheta needs bounds since I’m using binomial_logit, it’s just bad naming on my part)

I tried the Cholesky factor parametrization but couldn’t get it to sample even without a linear model on collective_benefit so I put it away for now. I’ll return to it later when I have this issue solved. Premature optimization and all (or does that not carry over to stan??). For my own sanity if nothing else - you’ll see I also rolled back the non-centered random effects for now to have a simpler generative model.

But I’m still getting the same error.

data{
    // number of data points, groups, and vowels
    int<lower=1> N;
    int<lower=1> N_group;
    int<lower=1> N_vowel;
    
    // binomial data for the two psychometric functions
    int<lower=0> k_ind[N];
    int<lower=0> k_gro[N];
    int<lower=0> n_ind[N];
    int<lower=0> n_gro[N];
    
    // input data
    int group[N];
    int vowel[N];
    vector[N] intensity;
    
    // collective benefit model
    vector[N_group] local_confidence;
    
}

parameters{
    
    // random effects
    vector[N_group] ia_group;
    vector[N_group] ib_group;
    vector[N_group] ga_group;
    
    vector[N_group] ia_vowel;
    vector[N_group] ib_vowel;
    vector[N_group] ga_vowel;
    vector[N_group] gb_vowel;
    
    
    // main effects
    real ga;
    real gb;
    real ia;
    real ib;
    
    // =========================================================
    // parameters for the collective benefit model
    real<lower=0> collective_sigma;
    real col_a;
    real col_b_confidence;
    real collective_benefit[N_group];
    
    
}

transformed parameters{
    vector[N_group] gb_group;
    for (k in 1:N_group) {
            gb_group[k] = logit(collective_benefit[k] / inv_logit(ib_group[k]));
    }
}


model{
    vector[N] gtheta;
    vector[N] gA;
    vector[N] gB;
    
    vector[N] itheta;
    vector[N] iA;
    vector[N] iB;
    
    vector[N] collective_mu;
    
    ia_group ~ normal(0,1);
    ga_group ~ normal(0,1);
    ib_group ~ normal(0,1);
    
    ia_vowel ~ normal(0,1);
    ga_vowel ~ normal(0,1);
    ia_vowel ~ normal(0,1);
    ib_vowel ~ normal(0,1);
    
    ia ~ normal(0, 1);
    ib ~ normal(0, 1);
    ga ~ normal(0, 1);
    gb ~ normal(0, 1);
    
    
    // psychometric function for both individuals and groups
    gA = ga + ga_vowel[vowel] + ga_group[group]; 
    gB = gb + ga_vowel[vowel] + gb_group[group];
    
    iA = ia + ia_vowel[vowel] + ia_group[group]; 
    iB = gb + ia_vowel[vowel]  + ib_group[group];
    
    for (i in 1:N){
        // theta is not actually a rate parameter since I'm using binomial_logit
        gtheta[i] =  gA[i] + gB[i] * intensity[i];
        itheta[i] =  iA[i] + iB[i] * intensity[i];
    }
        
    k_gro ~ binomial_logit(n_gro, gtheta);
    k_ind ~ binomial_logit(n_ind, itheta);
    
    
    
    // here comes the collective benefit stuff
    
    collective_benefit ~ normal(collective_mu, collective_sigma);
    collective_mu = col_a + col_b_confidence * local_confidence;
    
    col_a ~ normal(1,1);
    col_b_confidence ~ normal(0,1);
    collective_sigma ~ cauchy(0,2);
    
}


I couldn’t follow how you transformed the model (not because it looks wrong, but just because the model’s very complicated and I don’t have a lot of time to spend debugging user models).

What error are you getting now?

I think it might help to pull back to a simpler model and build up to this more complex model if there’s any way to make a path from simple to where you want to go. It’s much easier to debug that way and you make up for the extra model building time in the end.

YES, I finally managed to make it work.

Bob wrote:
I think it might help to pull back to a simpler model and build up to this more complex model

I’d like to be honest and say this kinda annoyed me at first since that’s what I’d been trying to do all along. But hey, it’s good advice and I eventually followed it.

The problematic part of this model is the transformation that links the two binomial models with the collective benefit linear model. And I’ve obviously made tons of errors trying to specify that in a way that makes the model generative.

But I did find out that premature optimization is still bad when working with stan. I had vectorized the likelihood statement collective_benefit ~ normal(mu,sigma) but that actually kept failing. When I rolled back the optimization that’s when it finally started sampling and I had something to work with…

[?] Next I’d like to do model comparison (collective_benefit ~ 1 vs various predictors) with WAIC, but it doesn’t seem straightforward since the only observed nodes I have are the k’s and n’s for the psychometric curves.
Can I just compute log likelihoods from the two binomial models and add them or something?
Edit: Or treat them separately and have log_lik[N*2] values?

data{
    // number of data points, groups, and vowels
    int<lower=1> N;
    int<lower=1> N_group;
    int<lower=1> N_vowel;
    
    // binomial data for the two psychometric functions
    int<lower=0> k_ind[N];
    int<lower=0> k_gro[N];
    int<lower=0> n_ind[N];
    int<lower=0> n_gro[N];
    
    // input data
    int group[N];
    int vowel[N];
    vector[N] intensity;
    
    // collective benefit model
    vector[N_group] local_confidence;
    
}

parameters{
    
    // random effects
    vector[N_group] ia_group;
    vector[N_group] ib_group;
    vector[N_group] ga_group;
    
    vector[N_vowel] ia_vowel;
    vector[N_vowel] ib_vowel;
    vector[N_vowel] ga_vowel;
    vector[N_vowel] gb_vowel;
    
    // main effects
    real ga;
    real gb;
    real ia;
    real ib;
    
    // =========================================================
    // parameters for the collective benefit model
    real<lower=0> collective_benefit[N_group];
    real<lower=0> collective_sigma;
    real col_a;
    real col_b_confidence;
    
}

transformed parameters{
    vector[N_group] gb_group;
    vector<lower=0,upper=1>[N] gb_group_theta;
            
    for (i in 1:N) {
        // collective benefit = inv_logit(dyad slope) / inv_logit(individual slope)
        // isolate for the group effect of the dyad slope
        gb_group_theta[i] = inv_logit(ib + ib_vowel[vowel[i]] + ib_group[group[i]]) * collective_benefit[group[i]];
        gb_group[group[i]] = logit(gb_group_theta[i]) - (gb + gb_vowel[vowel[i]]);
    }
}


model{
    // intermediate parameters for the psychometric curves
    vector[N] gtheta;
    vector[N] gA;
    vector[N] gB;
    
    vector[N] itheta;
    vector[N] iA;
    vector[N] iB;
    
    // intermediate parameter for the collective benefit model
    vector[N] collective_mu;
    
    // priors
    ia_group ~ normal(0,1);
    ib_group ~ normal(0,1);
    ga_group ~ normal(0,1);
    
    ia_vowel ~ normal(0,1);
    ib_vowel ~ normal(0,1);
    ga_vowel ~ normal(0,1);
    gb_vowel ~ normal(0,1);
    
    ia ~ normal(0, 1);
    ib ~ normal(0, 1);
    ga ~ normal(0, 1);
    gb ~ normal(0, 1);
    
    // psychometric function for both individuals and groups
    gA = ga + ga_vowel[vowel] + ga_group[group]; 
    gB = gb + gb_vowel[vowel] + gb_group[group];
    
    iA = ia + ia_vowel[vowel] + ia_group[group]; 
    iB = ib + ib_vowel[vowel] + ib_group[group];
    
    for (i in 1:N){
        // theta is not actually a rate parameter since I'm using binomial_logit
        gtheta[i] =  gA[i] + gB[i] * intensity[i];
        itheta[i] =  iA[i] + iB[i] * intensity[i];
    }
        
    k_gro ~ binomial_logit(n_gro, gtheta);
    k_ind ~ binomial_logit(n_ind, itheta);
    
    // here comes the collective benefit stuff
    
    // This gives "error evaluating log probability"
    // collective_mu = col_a + col_b_confidence * local_confidence;
    // collective_benefit ~ normal(collective_mu, collective_sigma);
    
    // But this works!!
    for (k in 1:N_group) {
        collective_mu[k] = col_a + col_b_confidence * local_confidence[k];
        collective_benefit[k] ~ normal(collective_mu[k], collective_sigma);
    }
    
    col_a ~ normal(1,1);
    col_b_confidence ~ normal(0,1);
    collective_sigma ~ cauchy(0,2);
    
}

PS: predictive posterior looks good … If only I had some more data :)

1 Like

In the original model, you had a function

f(a,b) ~ normal(m,s)

obviously there’s no jacobian involved, as you say there is no square matrix, just a function that takes a 2 dimensional volume and collapses it to a one dimensional space.

The right way to think about this statement is as a weighting function that multiplies the priors on a, b, so your real prior is p(a,b)*f(a,b) where p(a,b) is whatever you declared explicitly for priors on a,b

in such cases there is no jacobian to be calculated. The f(a,b) ~ normal(m,s) statement is just downweighting those values of a,b that lead to f values in the tail of the normal distribution.

1 Like

We’re trying hard not to be annoying. For what it’s worth, we’re always giving each other and ourselves the advice to start a bit simpler. That’s where Michael’s recent mixture model case study came from, for example.

You may want to start a second thread on computing WAIC here. Is there a way to think about the problem as a kind of cross-validation problem (predict some observations based on others)?

My statement above was a bit sloppy. The correct expression for the prior is
p(a,b) * Q(f(a,b)-m,s) / Z

where Q is the normal density and Z is the normalization constant (thankfully not needed for MCMC).

Q acts as a weighting function multiplying the simpler expression p(a,b) to modify the prior into something that only puts prior density in the regions around where f(a,b) is near m. Because of this things become de-normalized so you need to renormalize with the Z mathematically, but of course for MCMC you are ok with un-normalized densities.

Stan gives the warning about jacobians because it can’t distinguish between the case where you’re transforming variables and need jacobians, and the case where you’re applying a weighting function that has no jacobian.

The weighting function approach is generative. Simply generate according to p(a,b), then calculate f(a,b) and accept with probability Q(f(a,b)-m,s) otherwise re-generate. This emphasizes that what you’re doing is creating a slightly more complex form of prior.

I think I’m after a weighting function as well. Is that perhaps what Bob Carpenter is describing here: Simple model Parse Errors - #18 by Bob_Carpenter

Any chance you could suggest some literature or Stan manual entries for this?

It looks like it. Can’t really recommend any literature or examples, but if you think carefully about the density you want you should be able to express that density in terms of such “pseudo-sampling statements” like f(a,b) ~ normal(m,s) they are totally legit ways to express a function as a product of several bits… every ~ adds a separate term to a log density which is the same as saying it multiplies the density, so if you want a model like

p(data | parameters) W1(f(a,b) | parameters)W2(parameters)

Where W1*W2 = p(parameters) then just go for it.