Log1p bug

Hey all,

I just wanted to bring to your attention that log1p appears to be bugged. We were having extremely biased parameter recovery in the calculation of one of our likelihood model inputs using this function. These are how they were formulated in our code:

(-1/2)*log1p(p.^2)
sqrt(log1p(p.^2))

We tried many different tests and fixes of different parts of the code, including priors, hyperpriors, log trapping, and asymmetric overflow/underflow, before honing in on this being the problem. Once we reformulated them as shown, the bias in the parameter recovery disappeared:

log(1/sqrt(p.^2+1))
sqrt(log(p.^2+1))

I attempted to recreate the bias by fitting the parameters with the following intentionally misspecified code:

log10(1/sqrt(p.^2+1))
sqrt(log10(p.^2+1))

I also attempted:

log(1/sqrt(p.^2))
sqrt(log(p.^2))

Neither one recreated the bias, so we’re not sure how log1p might have been breaking. I just wanted to bring this to your attention

Sincerely,
Corey

Edit: Apologies. I nearly neglected to give some very important information. Our cluster uses cmdstan-2.30.1-mpi

Thanks for this report. Are you able to give values at which p resulted in bias?

Hey spinkney,

p in this case is one of the parameters, and not including extreme draws, the mean draw on p is 1.3052 with a standard deviation of 1.8076. In our parameter recovery for p itself, we got a very noisy, unbiased estimate. However, in another one of our parameters, the scaling of which depends on the above calculations, we saw a significant bias toward underestimating the parameters relative to the ones we used to simulate the data. This was most clearly demonstrated when we fixed all of the parameters except for the biased one:

For this parameter recovery, p was fixed at a value of 1.6487

Sincerely,
Corey

I might be missing something, but if you still see the unexpected behavior when p is fixed then we don’t even need to worry about bugs in the derivatives–just in the value itself. But there doesn’t seem to be a bug here:

data {
  real p;
}
model {
  print(log1p(p^2));
}
library(cmdstanr)
log1p <- cmdstan_model("/Users/Jacob/Desktop/log1p.stan")
log1p$sample(data = list(p = 1.6487), fixed_param = T, 
             iter_warmup = 1, iter_sampling = 1,
             chains = 1)
log(1 + 1.6487^2)

Note that the guts of this function are super simple:

I don’t have any theory as to why it should have caused a problem, but it was immediately resolved by making the noted changes in the code above. I don’t know if it’s relevant, but, in our code, p is a vector of reals, not a single real. So when I say “fixed” I mean that every entry in p has the same value

Here’s a version with vector p and that relies on the derivatives being correct, testing p in the vicinity of 1.6487.

data {
  int<lower = 1> N;
  vector[N] y;
  real<lower=0> sigma;
}
parameters {
  vector<lower = 0>[N] p; // constraint to avoid nonidentifiabiity from negative p
}
model {
  y ~ normal(log1p(p^2), sigma);
}
library(cmdstanr)

N <- 10
p <- rep(1.6487, N)
sigma <- .01
y <- rnorm(N, log(1 + p^2), sigma)

log1p <- cmdstan_model("/Users/Jacob/Desktop/log1p.stan")

output <- log1p$sample(
  data = list(
    N = N,
    y = y,
    sigma = sigma
    )
  )
output

Are you by any chance able to provide a reproducible example of a model where making the change you indicate rescues incorrect inference?

Sure, I actually just posted it on the modeling forum, but I can post it here for you. I have simulated data in a json if you need it. However, the parameter recovery program is built in Matlab.

functions {

	//custom function for lognormal inverse CDF
	matrix logncdfinv(vector x, vector m, vector s) {							

		int szx;
		int szm;
		szx = size(x);
		szm = size(m);	
		matrix[szx,szm] y;
		y = exp( rep_matrix(s',szx) * sqrt(2) .* rep_matrix(inv_erfc( -2 .* x + 2 ),szm) + rep_matrix(m',szx));
		return y;

		}

	//likelihood model for CASANDRE				
	real casandre_log(matrix resps, real guess, vector sds, vector se, real conf, int nt, int sampn) {			

		matrix[nt,4] llhC;

		matrix[nt,sampn] avgs;
		avgs = rep_matrix(se,sampn).*rep_matrix(sds',nt);
		
		for (tr in 1:nt){

			matrix[3,sampn] raws;

			for (rws in 1:sampn) {
				raws[1,rws] = normal_cdf(-conf,avgs[tr,rws],sds[rws]);
				}
			for (rws in 1:sampn) {
				raws[2,rws] = normal_cdf(0,avgs[tr,rws],sds[rws]);
				}
			for (rws in 1:sampn) {
				raws[3,rws] = normal_cdf(conf,avgs[tr,rws],sds[rws]);
				}

			vector[3] ratiodist;

			ratiodist[1] = mean(raws[1,:]);
			ratiodist[2] = mean(raws[2,:]);
			ratiodist[3] = mean(raws[3,:]);

			llhC[tr,1] = ((guess/4) + ((1-guess)*ratiodist[1]));
			llhC[tr,2] = ((guess/4) + ((1-guess)*(ratiodist[2]-ratiodist[1])));
			llhC[tr,3] = ((guess/4) + ((1-guess)*(ratiodist[3]-ratiodist[2])));
			llhC[tr,4] = ((guess/4) + ((1-guess)*(1-ratiodist[3])));

 		}

		real ll;
		ll = sum(columns_dot_product(resps, log(llhC)));
		return ll;
		
	}
		
}
data {

	int ns;
	int nt;
	int sampn;
	matrix[4,ns*nt] respslong
	row_vector[ns*nt] orislong;
	vector[sampn] sampx;
	
}
transformed data {

	
	matrix[ns*nt,4] respstall;
	vector[ns*nt] oristall;
	int nth;

	respstall = respslong';
	oristall = orislong';

	array[ns] matrix[nt,4] resps;
	matrix[nt,ns] oris

	nth = 1

	for (sub in 1:ns) {

		resps[sub] = respstall[nth:(nth+nt-1),1:4];							

		oris[:,sub] = oristall[nth:(nth+nt-1)];

		nth = nth + nt;

	}

}
parameters {

	//Parameters
	vector<lower=0,upper=1>[ns] guess;
	vector<lower=0>[ns] sens;
	vector[ns] crit;
	vector<lower=0>[ns] meta;
	vector<lower=0>[ns] conf;

	//Hyperparameters
	real snm;
	real<lower=0> sns;
	real cm;
	real<lower=0> cs;
	real mm;
	real<lower=0> ms;
	real ccm;
	real<lower=0> ccs;

}
model {

	//Calculate local variables
	matrix[nt,ns] sm;
	vector[ns] sc;
	matrix[sampn,ns] xtrans;
	matrix[sampn,ns] sds
	matrix[nt,ns] se;

	sm = oris.*rep_matrix(sens',nt);
	sc = crit.*sens;

	xtrans = logncdfinv(sampx,log(1/sqrt(meta.^2+1)),sqrt(log(meta.^2+1)));							

	sds = 1./xtrans;
	se = sm-rep_matrix(sc',nt);
	
	//Hyperpriors
	snm 	~	normal(0,1);
	sns 	~	lognormal(0,1);
	cm 	~	normal(0,1);
	cs 	~	lognormal(0,1);
	mm 	~	normal(0,1);
	ms 	~	lognormal(0,1);
	ccm 	~	normal(0,1);
	ccs 	~	lognormal(0,1);

	//Priors
	guess 	~	beta(1,193/3);
	sens	~	lognormal(snm,sns);
	crit	~	normal(cm,cs);
	meta  	~	lognormal(mm,ms);
	conf  	~	lognormal(ccm,ccs);

	//Loop through the participants
	for (i in 1:ns) {

		//Likelihood Model						
		resps[i] ~ casandre(guess[i],sds[:,i],se[:,i],conf[i],nt,sampn);						//likelihood model for this this participant

	}

}

The change is in the line xtrans = logncdfinv(sampx,log(1/sqrt(meta.^2+1)),sqrt(log(meta.^2+1))); which used to read xtrans = logncdfinv(sampx,(-1/2)*log1p(p.^2),sqrt(log1p(p.^2));

What is the value of ns when you have problems?

That is the variable for the number of subjects, and we’ve seen this bias in the parameter recovery for all of our simulated data sets so far, where the number of subjects has been 5, 10, or 25 (and thus also ns)

Can you try your model again with -1/2 written either as -0.5 or -1/2.0?

The line (-1/2)*log1p(p.^2) is rounding to 0.

5 Likes

Good catch @spinkney .

@Corey.Plate for more see 2.1 Integer-valued arithmetic operators | Stan Functions Reference

1 Like

Sure. I’ll set up a 25 subject run on our cluster to to test this out and let you know what we get back!

Thanks for the link!

Writing the line as (-0.5)*log1p(p.^2) did indeed get rid of the bias in the parameter recovery!

5 Likes