Multiple outcomes hurdle model - exists?

I have some data which includes multiple highly correlated outcomes as continuous variables. I had planned to analyse it using a multivariate normal model similar to the seemingly unrelated regression example in section 21.5 of the manual. However, on inspecting the data, I see that many of the outcomes suffer from a limit of detection effect as you might analyse with a hurdle model if you were looking at one of them individually.


This makes me wonder if the hurdle model can be, or has been extended to the multivariate case? Searches have shed little light, so Iā€™m wondering if the Stan-meta-brain has any relevant wisdom ?

Just thinking out loud here:

What about combining the SUR with the multivariate (latent variable) probit approach from the manual. Then youā€™d have an 18 dimensional Normal with the latent ā€œutilitiesā€ of the probits and the log(!?)-normal outcomes. This would be pretty messy to implement, because youā€™d have to use the techniques from the missing data chapter as well.

1 Like

Thanks for the suggestion @Max_Mantei. I took a look at that probit model and it doesnā€™t seem quite right for what I want. A univariate hurdle model for continuous variables (as far as I understand), has both a bernoulli component and either a gamm distribution or a lognormal. From testing univariate models of above I think a lognormal is more appropriate. However I donā€™t know if there such a thing as a multivariate lognormal distribution in Stan ? If I understand correctly a Wishart distribution is a multivariate gamma, so perhaps as a first step I could adjust the SUR example to a Wishart distribution as a first step and try to figure out the bernoulli component later?

One issue with this is that my mvn model (below) uses the cholesky decomposition for speed. Is there a Wishart equivalent to this?

data{
    int<lower=0> NvarsX;   // num independent variables
    int<lower=0> NvarsY;   // num dependent variables
    int<lower=0> N;        // Total number of rows
    vector[NvarsX] x [N];  // data for independent vars
    vector[NvarsY] y [N];  // data for dependent vars
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; // L form of correlation matrix for Yvars
    vector<lower=0>[NvarsY] L_sigma;      // Vector of scales for covariance matrix for Yvars

    vector[NvarsY] alpha;         // Intercepts for each Y
    matrix[NvarsY, NvarsX] beta ; // Coefficients for each X vs each Y
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);    // L form of Sigma covariance matrix for Yvars

    for (n in 1:N){
            mu[n] = alpha + beta * x[n];
    }
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0, 5);
    
    alpha ~ normal(0, 3);
    to_vector(beta) ~ normal(0, 2);

    //likelihood
    y ~ multi_normal_cholesky(mu, L_Sigma);
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;   // Square form of correlation matrix
    matrix[NvarsY, NvarsY] Sigma;   // Square form of covariance matrix
        
    // Generate correlation matrix
    Omega = multiply_lower_tri_self_transpose(L_Omega);     
    Sigma = quad_form_diag(Omega, L_sigma);               
}

Hello!

Using factor style latent variables to capture the covariance structure might be a solution!

The following blog post presents a very simple case, but you can put observed predictors along with latent ones!

EDIT : OMG, forgot the link!

https://rive-numeri-lab.github.io/2018/09/04/GBFM.html

Lucas

3 Likes

Nice reference from Lucas, Iā€™ll have to check this more thoroughly later.

My idea was more in the direction of working with the latent variable approach of the probit (where the latant variable is normally distributed) and taking the log of the outcomes and assume it to be normally distributed as well (implying a log-normal). So for one of your outcome variables you can formulate:

\begin{bmatrix} z_i \\ \log y_i \end{bmatrix} \sim \text{Multi-Normal} \left( \begin{bmatrix} \eta_i \\ \mu_i \end{bmatrix}, \begin{bmatrix} 1 & \rho\sigma \\ \rho\sigma & \sigma^2 \end{bmatrix} \right)

where P(y_i=0)=\Phi(z_i), with \Phi the cdf of the normal distribution, therefore implying a Probit regression (check the latent variable approach to the Probit in the manual!), and you treat the cases where y_i=0 \rightarrow \log y_i = -\infty as missing data when setting up the log_y vector in Stan (see missing value chapter in manual). Since there is a change of variable (from y to \log y) you need to apply the Jacobian correction (this case of a log-normal is actually covered in the manual as well): target += -log_y;. The correlation \rho captures the fact that you probably expect higher values of \log y whenever the probability of a zero value is low (implying a negative correlation, which is sort of a nice add-on to a conventional hurdle model, I guessā€¦ you can set this to zero if you donā€™t want thisā€¦).

Now, my actual idea for two of your outcome variables:

\begin{bmatrix} z_{1,i} \\ \log y_{1,i} \\ z_{2,i} \\ \log y_{2,i} \end{bmatrix} \sim \text{Multi-Normal} \left( \begin{bmatrix} \eta_{1,i} \\ \mu_{1,i} \\ \eta_{2,i} \\ \mu_{2,i} \end{bmatrix}, \begin{bmatrix} 1 & \rho_1\sigma_1 & \omega_{1,1} & \omega_{1,2}\sigma_2 \\ \rho_1\sigma_1 & \sigma_1^2 & \omega_{2,1}\sigma_1 & \rho_{1,2}\sigma_1\sigma_2 \\ \omega_{1,1}& \omega_{2,1}\sigma_1&1& \rho_2\sigma_2 \\ \omega_{1,2}\sigma_2&\rho_1\sigma_1\sigma_2& \rho_2\sigma_2& \sigma_2^2 \end{bmatrix} \right)

where the \rho's capture the correlations ā€œwithinā€ the hurdle models (as above) and the \omega's capture the SUR correlations - but somewhat extended to the hurdle parts. Again, you can set to zero, what you donā€™t like - but then you canā€™t really use the neat LKJ priorā€¦

Edit: oh, I forgotā€¦ the Jacobian correction for this multivariate change of variables is super easy, if Iā€™m not really mistaken: it should be target += -log_y1 -log_y2; but you might want to re-check.

2 Likes

Thanks both Lucas and Max. All of that is a bit over my head right now, Iā€™ll have to take some time to digest it. Will come back when Iā€™ve had the chance. Appreciate your postings.

Hi Max, I partially understand your post now. But one thing Iā€™m not clear on is the bit about Jacobian corrections. I indeed saw the discussion in the manual about change of variables for log_normal and because I didnā€™t understand it is the reasons I started this thread rather than just having a go myself blindly šŸ˜…

To help me relate to this - lets forget the latent part for a moment. Suppose I simply want a multivariate lognormal and so want to change this line from my code above:

    y ~ multi_normal_cholesky(mu, L_Sigma);

to a log normal. My first instinct would be to do this:

log(y) ~ multi_normal_cholesky(mu, L_Sigma);
target += -log(y);

but reading the section on ā€œMultivariate change of parametersā€ its not that simple? Seems I need to tranform mu first? I donā€™t understand this section:

I donā€™t understand whats going on in the ... compute lines.
Or is this a case where we konw Jacobians analytically? (Jacobians new to me)

Edit: Also - does the fact I was using the cholesky version of multi_normal change things here?

Hey! No worriesā€¦ yeahā€¦ Jacobians. It took me a while to understand whatā€™s that all about. And I probably donā€™t understand anything that there is to it, yetā€¦

So, my intuition about the Jacobian is the following (veeeery loosely speaking): If you transform a (random) variable, you either have a linear or non-linear transformation. In essence, a linear transformation keeps the spacing in the vicinity of points equal, i.e. everything is stretched out (or squeezed) and shifted the same way. When you integrate, this doesnā€™t really matter - sure, you have some stretching or squishing, but in the end itā€™s gotta integrate to 1 (probabilities!), so as long as everything stay proportionally equal to everything else, youā€™re fine and no correction needed. Non-linear transformations donā€™t have that propertyā€¦ thatā€™s why theyā€™re non-linearā€¦ so in some places your have more stretching (or squishing, or both!) than at other points. This is what you have to correct for.

Matrices can be understood as linear transformations. And their determinant tells you how much space is streched (or squished) using this transformation. The Jacobian is the matrix of partial derivatives of a multivarite function. It basically tells you, where everything moves (marginally) using the transformation. The determinant of the Jacobian thus tells you, how much space is streched and squished at every point. (Did I mention that I really like 3Blue1Brownā€¦haha)

In your case you have a one-to-one change of your 9 (?) outcome variables: the transformation of one variable doesnā€™t affect the change in the others. The Jacobian is thus a diagonal matrix with the partial derivatives of the transformation. The determinant of a diagonal matrix is just the product of its diagonal entries and taking logs (of the absolute values) turns the product to a sum. So you can just happily add the log of the absolute value of the partial derivative to the target in Stan.

Phew. I really hope I didnā€™t mess anything upā€¦ haha

And yes, you can still use the Cholesky version.

5 Likes

Really informative post and great videos thank you - Iā€™m alot clearer on it now. On thing you said that I didnā€™t quite understand:

What do you mean one-to-one change ? And why does the transformation not effect other variables - simply because it is just a log and not some function e.g. fn ~ 2 y1 +0.5 y2 +ā€¦

EDIT: Iā€™m also kinda learning this right now ā€“ this is also why I took my time with this post. Please, donā€™t treat this post as authoritative. If all is correct and you learn somethingā€¦ all the better! :)

Yes, thatā€™s what I meant: The function you transform y_1 with (lets say f_1(\cdot)=\log(\cdot) only transforms y_1 and the function you transform y_2 with f_2(\cdot)=\log(\cdot) only transforms y_2. (That notation is a bit clunky, but I think you get the point.) Another way to think about this is would be this (let each y_1, \ldots, y_k be vectors with N observations each):

f_1(y_1,y_2,\ldots,y_k) =\log y_1 \\ f_2(y_1,y_2,\ldots,y_k) =\log y_2 \\ \vdots \\ f_k(y_1,y_2,\ldots,y_k) =\log y_k \\

with partial derivatives:

\dfrac{\partial f_1(y_1,y_2,\ldots,y_k)}{\partial y_1} =1/y_1 \\ \vdots \\ \dfrac{\partial f_k(y_1,y_2,\ldots,y_k)}{\partial y_k} =1/ y_k \\

But these are not all the partial derivativesā€¦ think of \partial f_1/\partial y_2 and so on. However, all those others are kinda trivialā€¦ they are just 0. So the Jacobian looks like this:

\begin{bmatrix} \frac{\partial f_1}{\partial y_1} & \frac{\partial f_1}{\partial y_2} & \cdots & \frac{\partial f_1}{\partial y_k} \\ \frac{\partial f_2}{\partial y_1} & \frac{\partial f_2}{\partial y_2} & \cdots & \frac{\partial f_2}{\partial y_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_k}{\partial y_1} & \frac{\partial f_k}{\partial y_2} & \cdots & \frac{\partial f_k}{\partial y_k} \\ \end{bmatrix} = \begin{bmatrix} 1/y_1 & 0 & \cdots & 0 \\ 0 & 1/y_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1/y_k \\ \end{bmatrix}

and the determinant of this is just 1/y_1 \times 1/y_2 \times \ldots \times 1/y_k. (Since we know that all y are positive, we donā€™t need to take the absolute value here.) Now we only have to take the log of this determinant: \log(1/y_1) + \log(1/y_2) + \dots + \log(1/y_k). This can be rewritten as:

-\log(y_1)-\log(y_2)-\ldots-\log(y_k)

and in Stan (with vector[N] log_y[K])

for (k in 1:K) 
    target += -log_y[k]

(I wonder if you could do for (k in 1:K) target -= log_y[k], looks weirdā€¦ haha.)

And as for the one-to-one transformationā€¦ Itā€™s one input and one output, as opposed toā€¦ letā€™s say the mean or sum as a many-to-one transformation. One-to-many transformationsā€¦ err, maybe sqrt? But, yeah, thatā€™s the pointā€¦

All of the above, you can find in this pretty neat paper (itā€™s more like a look-up-thing to me). The context is different (Deep Learning) but it covers Jacobians and one-to-one, many-to-one etc. transformations nicely (IMO).

1 Like

Again - really informative post thank you. So I had a go at implementing a log-normal. I generated some data as follows: gen_logmvn_y2_x3.R (888 Bytes)
I modified my code above to look like this:

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0, 5);
    
    alpha ~ normal(0, 3);
    to_vector(beta) ~ normal(0, 2);

    //likelihood
    log(y) ~ multi_normal_cholesky(mu, L_Sigma);
    for (dv in 1:NvarsY) 
        target += -log(y[dv]);
}

Ran it and extracted as follows:

fit1 <- stan(file="Stan models/linear_mvlognorm.stan",
            data = datalist, chains = 4,
            seed=40, refresh=25, iter = 2000)

params = extract(fit1)
#Extract intercepts
> sapply(1:datalist$NvarsY, function(y) mean(params$alpha[, y]) )
[1] 19.98743 20.00082

#Extract betas
> sapply(1:datalist$NvarsY, function(x) sapply(1:datalist$NvarsX, function(y) mean(params$beta[, x, y])) )
          [,1]      [,2]
[1,] 0.2957225 0.5843032
[2,] 0.3633446 0.6990684
[3,] 0.1943753 0.2902356

The design matrix for these was:

> sigma[3:5,1:2]
    y1  y2
x1 0.3 0.6
x2 0.4 0.7
x3 0.2 0.3

So I guess that seems pretty good, although the betas are a little off. Funnily enough - I tested it commenting out the target+= statement and it didntā€™ seem to make much difference.

Edit: One other thing Iā€™m wondering - I donā€™t understand the target += thing very well. But log(y[dv]) should be a vector of length N. How does target +=know to add log(y[dv]) to the correct vector in the array of outcome vectors ? My y variable is declared as: vector[NvarsY] y [N];

So, I was thinking about how to extend this to include a hurdle model. The univariate poisson hurdle model example in the manual makes use of an if statement:

if (y[n] == 0)
    target += log(theta);
else
    target += log1m(theta) + poisson_lpmf(y[n] | lambda)
                              - poisson_lccdf(0 | lambda));

So in the multi-variate case this is not going to be so simple because vector of [NvarsY] y might have some yā€™s with value 0 and others with value > 0. I was thinking then that I could take inspiration from the multivariate missing values section in the manual (section 4.5), and thru combination of a for loop with counts of zeros and indexing each loop, fit a logmvn of varying dimensions each interaction and a corresponding hurdle component. However I have run into a snag. In the multivariate missing values section the example uses a multi_normal distribution, not multi_normal_cholesky as I am using. So in the example there is no problem to slice up the Sigma matrix like this:

y[ns12] ~ multi_normal(mu, Sigma);
y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2]));

But can I do that with L_Sigma (I suspect not). Is there a way around this without resorting to changing everything to a multi_normal distribution ?