Why is so hard to fit a regression (hierachical) - variance shrink to 0 untill goes out of convergence

I have a hierarchical linear regression, that works like this:

In the following model, when p_target gets << 0.5 the information for calculating the proportions decreases, therefore the hierachical regression on the proportions stops to cenverge, because the variance on the regression goes to 0.

If I put a lower bound on the variance of 0.1 for example the convergence is reached (which is not that generative).

How can I behave?

data{
	int G;                                       // Number of marker genes
	int P;                                       // Number of cell types
	int S;                                       // Number of mix samples 
	int<lower=1> R;                              // Number of covariates (e.g., treatments)
        matrix[S,R] X;                               // Array of covariates for hierarchical regresison
	matrix<lower=0>[S,G] y;                      // Observed counts
	matrix<lower=0>[G,P] x;                      // signature counts
	vector<lower=0, upper=1>[S] p_target;      //This is the proportion of the whole taget -> simplex_beta * p_target
	matrix[S,G] y_hat_background;   // This is the matrix of background -> will be summed up with the matrix of the target
}
transformed data{
	matrix<lower=0>[S,G] y_log;                 // log tranformation of the data
	y_log = log(y+1);                           // log tranformation of the data

}
parameters {
	simplex[P] beta[S];                        // Proportions,  of the signatures in the xim
	real<lower=0> sigma;                       // Variance linear model 
	matrix[R,P] alpha;                         // design matrix of the hierarchical regression
	vector<lower=0>[P] phi;                    // Variance of the hierarchical regression
}
transformed parameters{
	matrix[S,P] beta_target;                  // Multiply the simplex by the whole proportion of the target
	matrix[S,G] y_hat_target;                 // The predicted counts for the linear model
	matrix[S,G] y_hat; 
	
	for(s in 1:S) beta_target[s] = to_row_vector(beta[s]) * p_target[s];
	y_hat_target = beta_target * x';
	y_hat = y_hat_target + y_hat_background;
}
model {

	matrix[S,P] beta_hat;                    // Predicted proportions in the hierachical linear model
	sigma ~ normal(0, 0.1);
        to_vector(y_log) ~ student_t(8, log(to_vector(y_hat_target)), sigma); // Linear system

        //p_target it's < 0.5 usually 0.01, thus I use log-link instead of inv_logit

        to_vector(alpha) ~ normal(0, 0.2);
        phi ~ cauchy(0, 2);

//////////////////////////////////////////////////////////////////////////
// WHEN p_target is small (so the proportions gets really small compared 
// to the background) adding this regression cause lack of regression. 
// If I take this out, then the proportions are predicted with convergence
//////////////////////////////////////////////////////////////////////////

    beta_hat = exp(X * alpha);

    for(s in 1:S)	beta[s] ~ normal(beta_hat[s], phi); 

//////////////////////////////////////////////////////////////////////////
// I tried Also to treat this as lognormal, no improvements
//////////////////////////////////////////////////////////////////////////

  //  beta_hat = X * alpha;

  //   for(s in 1:S)	beta[s] ~ lognormal(beta_hat[s], phi); 

//////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////
}

image

But if I add forcely 0.1 for the phi variance I get convergence

beta[s] ~ normal(beta_hat[s], phi + 0.1); 

image

Here is the design matrix for the hierarchical regression

> my_design
   (Intercept)       cov
1            1 -1.207436
2            1 -1.207436
3            1 -1.207436
4            1 -1.207436
5            1 -1.207436
6            1 -1.207436
7            1 -1.207436
8            1 -1.207436
9            1 -1.207436
10           1 -1.207436
11           1 -1.207436
12           1 -1.207436
13           1 -1.207436
14           1 -1.207436
15           1 -1.207436
16           1 -1.207436
17           1 -1.207436
18           1 -1.207436
19           1 -1.207436
20           1 -1.207436
21           1 -1.207436
22           1  1.207436
23           1  1.207436
24           1  1.207436
25           1  1.207436
26           1  1.207436
27           1  1.207436
28           1  1.207436
29           1  1.207436
30           1  1.207436
31           1  1.207436
32           1  1.207436
33           1  1.207436
34           1  1.207436
35           1  1.207436
36           1  1.207436
37           1  1.207436
38           1  1.207436
39           1  1.207436
40           1  1.207436
41           1  1.207436
42           1  1.207436
43           1  1.207436
44           1  1.207436
45           1  1.207436
46           1  1.207436
47           1  1.207436
48           1  1.207436
49           1  1.207436
50           1  1.207436
51           1  1.207436
52           1  1.207436
53           1  1.207436
54           1  1.207436
55           1  1.207436
56           1  1.207436
57           1  1.207436
58           1  0.000000
59           1  0.000000
60           1  0.000000
61           1  0.000000
62           1  0.000000
63           1  0.000000
64           1  0.000000
65           1  0.000000
66           1  0.000000
67           1  0.000000
68           1  0.000000
69           1  0.000000
70           1  0.000000
71           1  0.000000
72           1  0.000000
73           1  0.000000
74           1  0.000000
75           1  0.000000
76           1  0.000000
77           1  0.000000
78           1  0.000000
79           1  0.000000
80           1  0.000000
attr(,"assign")
[1] 0 1

If the estimate goes to zero hierarchical variance, the simplest thing to do is just implement complete pooling directly. If you think the effect really is small, but nonzero, you could try a zero-avoiding prior like a lognormal or gamma (the density goes to zero as the value goes to zero), but you should carefully inspect where the probability mass is piling up. Presumably with a lower bound of 0.1, the mass is piling up just above 0.1—that’s a sign that the data’s consistent with lower value that you’re only ruling out with the constraint.

Thanks Bob,

can you explain with other words? You mean having one phi, instead of many?

I will

I am pretty sure 0.1 is too conservative, and indeed I tried to gradually lower it. For example at 0.05 it converges but with some divergent transitions. However I stopped because I felt putting an hard threshold is not “generative” at all, and I saw that with higher thresholds all slopes of the regression converge to 0.

You mean look at the mode in the prediction, and if it goes slightly above 0.1?

However, I would like a generative model that works without these tricks.

Would be a meaningful approach to :

  1. calculate the proportions in one model
  2. run a separate regression on the estimated proportions
  3. predict the distribution of those phi
  4. give that distribution as prior for phi in the hierachical integrated model, instead of a 0 centered distribution

(?)

Thanks

By complete pooling, I mean the usual thing. You could look at my case study on repeated binary trials which directly compares no-pooling, full pooling, and partial pooling.

No, modes are meaningless. I mean concentrate on where the probability mass is. Mass is volume times density—density by itself is meaningless and subject to reparameterization.

I’m guessing what happens is that cases that give Stan problems are cases where the model doesn’t match the data well. To test this, generate data from the model and see how well that works.