Weighted logistic regression

I’d like to implement a logistic regression model (with normal prior) accepting inputs and corresponding non-negative weights w_n (i.e. multiplicities of point’s loglikelihood). Is the following implemenation correct?

weighted_logistic_code = """
data {
  int<lower=0> N; // number of observations
  int<lower=0> d; // dimensionality of x
  matrix[N,d] x; // inputs
  int<lower=0,upper=1> y[N]; // outputs in {0, 1}
  vector[N] w; // weights
}
parameters {
  real theta0; // intercept
  vector[d] theta; // auxiliary parameter
}
model {
  theta0 ~ normal(0, 1);
  theta ~ normal(0, 1);
  for(n in 1:N){
    target += w[n]*bernoulli_logit_lpmf(y[n]| theta0 + x[n]*theta);
  }
}
"""

Per my understanding, the point estimation will be correct, but the standard error will be wrong (will be much more smaller than the true values).

I’m trying to approximate a larger dataset, so inflating likelihoods is part of my solution

@lauren knows a lot about complex surveys and might be able to chime in.

I think your solution is correct if indeed you are just doing this to avoid re-calculating the log likelihood for non-unique rows in your data set.

You can check this by fitting the two versions of your model (with and without weighting) for a smaller data set and compare the estimated parameters.

If you just want to reduce the number of calls to the likelihood, sufficient statistics is a different and probably also the best way to go.

If you have following data

data {
  int<lower=0> N_unique;   // number of unique rows in x
  int<lower=0> d;
  matrix[N_unique, d] x;
  int<lower=0> U[N];       // number of cases in each row of x
  int<lower=0> Y[N];       // number of cases in each row of x with value 1
}

You should be able to use the binomial distribution for your likelihood:

model {
  theta0 ~ normal(0, 1);
  theta ~ normal(0, 1);
  target += binomial_logit_lpmf(Y | U, theta0 + x*theta)
}

No need for a loop here, because binomial_logit_lpmf is vectorized. Here is the Stan documentation for the binomial_logit_lpmf: https://mc-stan.org/docs/2_22/functions-reference/binomial-distribution-logit-parameterization.html.

Also check the Stan documentation for something like “exploiting sufficient statistics”.

1 Like