Bivariate model for a meta analysis of diagnostic test accuracy

Hi,

I would like to fit a bivariate model for meta analysis of diagnostic test accuracy (sensitivity and specificity). I have approx 50 studies to be included with four cell counts for each study (namely, true positive, false positive, true negative, false negative).

In my codes (attached down below), I transformed the count data to logit of true positive rate and false positive rate and calculated their standard errors. To fit a bivariate normal models for sensitivity and specificity, I wanted to use the standard errors in my covariance matrix for within study variability, but the correlation is unknown. So I wanted to estimate the correlation for within study variability using the known variances while estimating a whole covariance matrix for between study variability.

As I got errors saying my covariance matrix is not symmetric, I tried using the matrix decomposition, but I now got the following error. I was thinking it occurred because I was trying to plug in the estimated variances into the covariance matrix for within variability even though I declared it in the transformed parameters block. Yet, I am not sure how to proceed.

Variable "cov_matrix" does not exist.
 error in 'model5e203298a5ca_sen_spe_uni' at line 63, column 14
  -------------------------------------------------
    61: 
    62: 
    63: 	cov_matrix[J] sigma_w_mat[K]; // vcov matrix for within study variability
                     ^
    64: 
  -------------------------------------------------

The data I am using for the above code is uploaded here:

My code is as follows.

Any help to fix the error, as well as any other comments on the modelling as a whole, would be greatly appreciated.

data {

	int<lower=1> K; /* Number of studies */
	int<lower=1> J; /* sensitivity and specificity = 2*/
	real<lower=0> TP[K];
	real<lower=0> FP[K];
	real<lower=0> FN[K];
	real<lower=0> TN[K];

}

transformed data{
	real<lower=0> sigma_tpr[K];
	real<lower=0> sigma_fpr[K];
	real TPR[K];
	real FPR[K];
	real LOGIT_tpr[K];
	real LOGIT_fpr[K];
	vector[J] LOGIT_pair[K];
	vector<lower=0>[J] sigma_w[K]; // within-study sd
	

	for (i in 1:K){
		TPR[i] = TP[i]/(TP[i] + FN[i]);
		FPR[i] = FP[i]/(TN[i] + FP[i]);
		
		LOGIT_tpr[i] = log(TPR[i]/(1-TPR[i]));
		sigma_tpr[i] = sqrt(1/((TP[i] + FN[i])*TPR[i]*(1-TPR[i])));
		
		LOGIT_fpr[i] = log(FPR[i]/(1-FPR[i]));
		sigma_fpr[i] = sqrt(1/((TN[i] + FP[i])*FPR[i]*(1-FPR[i])));

		LOGIT_pair[i,1] = LOGIT_tpr[i];
		LOGIT_pair[i,2] = LOGIT_fpr[i];

		sigma_w[i,1] = sigma_tpr[i]*sigma_tpr[i];
		sigma_w[i,2] = sigma_fpr[i]*sigma_fpr[i];
		
		
		}

}

parameters{
	vector[J] mu;
	vector[J] theta;
	vector<lower=0>[J] tau; // between-study sd

	corr_matrix[J] Omega_b; // correlation matrix for between study variability
	corr_matrix[J] Omega_w[K]; // correlation matrix for within study variability


}

transformed parameters {
	cov_matrix[J] sigma_b_mat; // vcov matrix for between study variability
	sigma_b_mat = quad_form_diag(Omega_b, tau);


	cov_matrix[J] sigma_w_mat[K]; // vcov matrix for within study variability

	for (i in 1:K){
		sigma_w_mat[i] = quad_form_diag(Omega_w[i], sigma_w[i]);
	}
	
}



model {
	
	theta ~ multi_normal(mu, sigma_b_mat);

	for (i in 1:K){
		LOGIT_pair[i] ~ multi_normal(theta, sigma_w_mat[[group[i]]);

	}
	

	// priors
	mu ~ normal(0,1);
	tau ~ cauchy(0, 0.5);
	Omega_w ~ lkj_corr(1); //LKJ prior on the correlation matrix
	Omega_b ~ lkj_corr(1); //LKJ prior on the correlation matrix
	

}
1 Like

This might be an easier model to think about if you use a different representation of the data. Specifically, for a given study you can re-code the 2x2 table of counts into two vectors, both containing 0/1, where one reflects the outcome of the diagnostic test (D) and the other reflects the truth/gold-standard (T) So, if you had a 2x2 of:

TP: 1
FP: 2
TN: 3
FN: 4

Then you’d have vectors:

D T
1 1
1 0
1 0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
0 1

Then, for a given study, and using R’s formula syntax, you have a generalized linear model:

study_fit = glm(
    data = study_data
    , formula = D ~ 1+T
    , family = binomial
)

Where, if you make T a factor and use sum contrasts, the intercept parameter will reflect bias of the Diagnostic test while the effect of T will reflect the sensitivity (in the signal detection theory sense; I hate how medical stats adopted the same term for a different quantity in the same realm) of the Diagnostic test.

From there, the formulation of a meta-analysis can be achieved by treating the different studies as “random effects”, so if you had the above data for all studies combined together (again, making T a factor with sum contrasts) with a third vector identifying the study, then using lme4/brms formula notation you’d do:

meta_fit ~ glmer( 
    data = study_data
    , formula = D ~ 1+T + ( 1+T | Study)
    , family = binomial
)

(Using glmer there only bc I’m not confident in my brms)

In which case you’ve implemented a model where there’s an across-study mean bias, an across-study mean sensitivity, then each study gets its own regularized bias and sensitivity as correlated variates.

Finally, if you want to characterize the posterior’s implications for the TP/FP/TN/FN cell probabilities, these are deterministic transforms of the bias & sensitivity.

Note that if you have lots of data and encounter slow sampling, there are tricks to speed up binomial outcomes by having the likelihood evaluated as counts, but I didn’t want to confuse things by starting with that.

Note that the above signal detection theoretic model implies a shared variance magnitude between the latent distributions; if that’s not appropriate then check out this blog series on more nuanced approaches.

Hi,

Thank you for your quick and detailed advise. I had no idea how to fit the model using generalised linear mixed modelling, so your advice is really helpful. One quick additional question is what the “latent distributions” you meant here because I do not know the signal detection theoretic model.

In the meantime, my initial intent is to try a bayesian approach to do the bivariate model, so I would still like to get any feedback on my code and/or formulation.

To be clear, I would like to try the models framed in this paper , and its SAS code is here.

It seems they did not estimate the within-study correlation, and I am not sure if it is sensible to estimate it, which is what I am trying to do.

Thank you!

Not sure you need to estimate the within-study correlation. I believe the only correlation parameter of interest in your research context is the between-study one, as noted here:

" As sensitivity and specificity are estimated from different samples in each study (diseased and non-diseased patients), they can be assumed to be independent so that the within-study correlations ô±żare set to zero. However, there may be a non-zero between-studies correlation ô±żô°ˆwhich should be accounted for."

Of note, we discuss the model you referenced in this post:

This model was applied to a context where there were multiple studies with 2 outcomes, both of which were estimated in the same population (in each study). In this case, both within- and between-study correlations have a role. I decided to assume within-study correlation = 0.5, and incorporated this info into the model through a covariance-variance matrix generated by metafor::vcor().

Hope this is helpful.

Oh, yes, and that’s the only correlation my glmer code estimates. I don’t think it even makes sense to speak of a within-study correlation unless you can find a way of pairing up individuals across treatment conditions, no?

Thank you very much for your comment!

I have figured out this and this sort of Stan codes works for me.


data {

	int K; /* Number of studies */
	int J; /* sensitivity and specificity = 2*/
	int TP[K];
	int FP[K];
	int FN[K];
	int TN[K];

}

transformed data {
	int Y_pos[K];
	int Y_neg[K];


	for(i in 1:K){
		Y_pos[i] = TP[i] + FN[i]; // num positive
		Y_neg[i] = TN[i] + FP[i]; // num negative
		}
}

parameters{
	vector[J] mu;
	vector[J] theta[K];
	vector<lower=0>[J] tau; // between-study sd
	cholesky_factor_corr[J] Lcorr;


}

transformed parameters{
	vector[J] P[K] = inv_logit(theta);
	
}

model {
	
	for (i in 1:K){
		TP[i] ~ binomial(Y_pos[i], P[i,1]);
		TN[i] ~ binomial(Y_neg[i], P[i,2]);
	}


	theta ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, Lcorr));
	

	// priors
	mu ~ normal(0,2.5); // which can cover extreme values like logit(0.999)
	//tau ~ normal(0, 5);
	tau ~ uniform(0, 10);
	Lcorr ~ lkj_corr_cholesky(2);
}

Edit: @maxbiostat edited this post for syntax highlighting.

4 Likes

Here is my data

> glmm_stan_data
     y   n outcome study
1   28  37      se     1
2   33  49      se     2
3   46  59      se     3
4   24  25      se     4
5   45  59      se     5
6   12  13      se     6
7   17  25      se     7
8   23  27      se     8
9   99 107      se     9
10  64  76      se    10
11  44  50      sp     1
12  55  65      sp     2
13  64  79      sp     3
14  21  33      sp     4
15  64  79      sp     5
16  14  18      sp     6
17  31  34      sp     7
18  34  35      sp     8
19 108 141      sp     9
20  78 101      sp    10

The corresponding Stan(brms) code is
fit_brm<-brm(y|trials(n) ~ 0+outcome+(0+outcome|study),data = glmm_data, family = binomial)
The model works well. However, I don’t have much experience with brms in specifying priors which is needed to be improved.

I have question,
I have 17 studies. However, some studies have sample size less than 30. (note that sample size=TP+FP+TN+FN)

How can I determine the minimum sample size?
I think if one study is based on sample size equal 17 for example, the estimation of sensitivity or specificity can be either overestimated or underestimated.