Using bigger sample for X than I have ys

I hope this isn’t too off-topic.

I am regressing continuous y on multi-dimensional X, and my sample for X is way bigger than I have y labeled. I’m wondering what kinds of approaches can be there to use the extra data. Unfortunately I was unable to formulate google query to help me out, as all I keep finding is about imputations for missing X (not my problem).

I’m pretty sure at least a few things can help me:

  1. Using bigger samples to standardize X (already done)
  2. Better covariance matrix (not 100% sure how to use it best)
  3. Perhaps corrections for sampling error if my y are picked non-uniformly (they’re indeed kind of stratified)

I’m wondering if pseudo-labeling is a thing in continuous-y, bayesian world…

My goal is prediction, not inference, but I need the betas to make sense.

Would appreciate any links to good literature on the topic, especially if bayesian and thus usable in Stan.

1 Like