Measurement error in a predictor that is a proportion

Let’s say I have a binary outcome, numerous data rows per group for each of several timepoints, a predictor p_prop that is a proportion of samples samps that are positive. I have no information at the samps level, just the number taken for each value of p_prop in the row of data.
Differing numbers of samples are taken for each timepoint for each group. I am interested in how p_prop and timepoints predict the outcome.

I might run a model in brms like this:

m1 <- brm(outcome ~ 1 + timepoints + p_prop + (1|group), family=bernoulli, data=data)

However, because I have vastly differing numbers of samples for each timepoint for each group, I would think that I want some sort of measurement error model, because the p_prop value for 5 samples should have more error than a group with say 400 samples. Since I have the number of samples and the proportion, I could calculate the sd and use that in a measurement error model like this:

m1.me <- brm(outcome ~ 1 + timepoints + me(p_prop, p_prop_sd) + (1|group), family=bernoulli, data=data)

But I’m not sure this is what I want (and brms throws an error), because I have a lot of p_prop values that are 0 and 1, and the sd of a proportion that is 0 or 1 is 0. But it seems that I would want the uncertainty in the value of the predictor for p_prop to scale with the number of samples taken…right? How do I go about doing this? Maybe I’m not thinking about this the correct way, but despite the formula for sd of a proportion, at least in my scenario, I trust the value from 5 samples much less than I do from 400 samples, despite the value of the proportion itself.

Thanks!

So one thing I thought about doing was expanding the number of rows in the dataset by the number of samples taken for each group at each timepoint. For example, if group A at timepoint 1 had 10 samples, then I created 10 rows in the data that were all identical. If group B at timepoint 1 had 450 samples, then I created 450 rows in the data that were all identical. Of course the outcome variable, which is binary, would be 0 or 1 for every row in the expanded data. This changes the proportion of 0’s and 1’s in the expanded data compared to the original data, as the samples taken differs for every row in the original data.

The idea is that this would weight the information from group’s with more samples that were taken and use partial pooling among groups… I am still not sure this is a correct solution though…any time you simply duplicate rows in data, the SE’s for the model coefficients goes down… The outcome is not directly related to the number of samples taken, it’s related to the p_prop, and the accuracy of the p_prop is what is related to number of samples taken.

I’ve done something like a measurement error model for proportional data. First, I just set anything 0 or 1 to .00001 and .99999, respectively. I know, not a great solution, but it’s practical.

Second, I modeled the proportions using a beta distribution, parameterized by the mean and kappa (a scaling factor, such that the higher it is, the more concentrated the beta is. So it looks something like:

observed_proportion ~ beta_proportion(true_proportion, observed_count + 1);

Then use “true_proportion” (a vector with same length as observed_proportion) elsewhere in your model.

That sort of formulation can be used for some Kriging approaches too; not relevant to your question at all, but it’s certainly interesting also (i.e., you can write a function so that the kappa parameter is high when some other inputs are nearby to one another; like a transformation on radial basis functions).

1 Like

Nice! Thanks for help. I had tried adding a small amount to 0 and subtracting a small amount to the 1 proportions and then calculating the sd of the proportion using the sample size, and then using the sd in the measurement error model that I specify above. However, the sd is still very low for small samples that are near zero and one.

I like your idea of modeling the observed proportions using a beta distribution… I’ll see if I can do something similar to this in brms using the missing data and non-linear model syntax.

I wasn’t able to figure out how to do this in brms, but I have made a simple sim and simple model with initial pointers on the model from @Stephen_Martin, and it seems to work as intended! Thanks @Stephen_Martin !
I checked the true_proportion sd’s from the model against the observed_counts (I have sorted them by observed_count in the .csv sheet) and sure enough the sd’s in general increase as the observed_counts decrease. Nice!
Here is my code for sim and simple model:

#define true proportions and generate observed proportions from the true, using beta distribution scaled by the observed_count
n <- 100
observed_count <- rnbinom(n, mu=50, size=1) + 10
true_proportion <- runif(n, 0.0000001, 0.9999999)
mu <- true_proportion
shape1 <- mu * (observed_count + 1)
shape2 <- (1 - mu) * (observed_count + 1)
observed_proportion <- rbeta(n, shape1, shape2)
observed_proportion <- ifelse(observed_proportion==0, (observed_proportion + 0.0001), observed_proportion)
observed_proportion <- ifelse(observed_proportion==1, (observed_proportion - 0.0001), observed_proportion)

#generate outcome variable with the true proportion as the predictor
a_b <- 0.01
b_b <- 0.7
y_p <- a_b + true_proportion*b_b
y <- rbinom(n, 1, y_p)

#data
d1 <- list(N = n,
           y = y, 
           observed_proportion = observed_proportion, 
           observed_count = observed_count)

#stan model
stan_code <- "
data {
  int<lower=0> N;
  vector[N] observed_count;
  vector[N] observed_proportion;
  int <lower=0, upper=1> y [N];
}
parameters {
  vector<lower=0, upper=1>[N] true_proportion;
  real alpha;
  real beta;
}
model {
  observed_proportion ~ beta_proportion(true_proportion, observed_count + 1);
  y ~ bernoulli_logit(alpha + beta * true_proportion);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
}
"

#fit model
fit1 <- stan(model_code = stan_code, data = d1,
             chains = 4, iter = 2000, warmup = 1000, 
             thin = 1, cores = 4)

print(fit1)

And the check sheet.
count vs sd check.csv (3.8 KB)

I was thinking I should put a prior on the parameter true_proportion , but I am not sure how, since it is a vector…unfortunately, I am not as familiar with Stan code as I am brms. The things I tried gave me various errors.

Hi, if you’re comfortable working with Stan directly, the implementation of a measurement error model for p_{prop} would be very straightforward. Given that you know the number of samples N and the observed proportion p_{prop} positive, the “true” proportion positive p_{true} can be assumed to follow a beta distribution with shape parameters \alpha = p_{prop} * N +1 and \beta = N - \alpha +1. This approach assumes a uniform prior for the unknown parameter p_{true}. You can then use p_{true} as a predictor of your outcome of interest.

I don’t know whether this is possible to implement via brms.

Edits: formatting stuff

P.s.: i see that the use of a beta distribution was already suggested. There is no need though to put bounds very close to zero and one. You should be fine putting them at zero and one directly

Thanks! I think I implement something similar with beta_proportion.

I’m not sure what you mean here. The Stan documentation says theta can’t be zero or one, so that is why I added a little bit to proportions that are zero and subtract a little from those that are one.

Would you have any idea how to add a varying effect such as described in this post? In my real application, I have many measurements per grouping level, and I would like information about true_proportion to be correlated within groups.

If you set the bounds at zero and one, the NUTS sampler will sample between those bounds. You don’t have to add or subtract small values to the bounds.

I’ll reply over there when I have time. Better to keep topics and replies in the same place :).

1 Like

Is N here the total number of samples or the number of positive samples? Thanks!

N is the number of samples:

If the data include 0 and 1, then it will throw errors.

beta(0 | a, b) is 0; log(0) is not defined.
same with beta(1 | a,b) = 0; log(0) is not defined.

That’s why when you have proportional data with 0s and 1s, just add a small fudge factor.

I get what you’re saying, but it doesn’t apply in my view. Given a sample of N observations, of which x are positive, the posterior distribution of the unknown true prevalence p is p \sim Beta(x + 1, N - x + 1). The “+1” term reflects a uniform prior for the unknown true prevalence.

In other words, the Stan model doesn’t (shouldn’t) have anything like “beta(0 | a, b)” in it. The number of positives should be used as input for the shape parameter of the beta distribution (+1), and as a data point for which the density is calculated (as in your example).

Yes, it will have beta(0 | a, b), if your data has 0 in it.

Imagine you have N quantities, each one a proportion in (0, 1); some of these are exactly 0, some are exactly 1, the vast majority are in between 0 and 1.

If you feed it 0 as data to a beta distribution, it does not matter what your alpha and beta parameters are: It will evaluate to 0 in the numerator of the pdf. The log pdf is therefore log(0), and undefined.
When you are sampling from a beta, you do not need to worry about it; but you 100% do need to either fudge the 0s and 1s, or use a zero-one-inflated beta model.

My point exactly :). And I think only this point applies because for the OP’s problem one only needs to sample from a beta distribution (with known shape parameters based on the number of positives and negatives in the data), relying on the conjugacy of the beta and binomial distribution. There is no need to evaluate the density of observed proportions (which may indeed be zero) and therefore no need for fudging. Anyway, the OP has elaborated om the problem in another post (whivh they cite somewhere above), where it becomes clear that a hierarchical logistic regression model is actually more convenient (in my view).

1 Like

In case it’s helpful, another way to restate @LucC’s approach is that the underlying likelihood here is binomial, which has no problem with literal zeros in the data. The beta distribution with the +1 terms is being used as something like a sufficient statistic. The true binomial proportion is assumed to never be a literal zero, which is why it’s unproblematic to assume that it arises from a beta distribution.

2 Likes

Thanks for reframing my point @jsocolar. In a Stan model, this would be like modelling the data with binomial likelihood with unknown probability p of success, which is specified as a parameter and is given a beta(1,1) prior. The analytical solution to that is drawing p from a beta(a+1, b+1) distribution, with a and b being the number of positive and negative observations.

2 Likes