I hope this isn’t too off-topic.
I am regressing continuous y
on multi-dimensional X
, and my sample for X
is way bigger than I have y
labeled. I’m wondering what kinds of approaches can be there to use the extra data. Unfortunately I was unable to formulate google query to help me out, as all I keep finding is about imputations for missing X
(not my problem).
I’m pretty sure at least a few things can help me:
- Using bigger samples to standardize X (already done)
- Better covariance matrix (not 100% sure how to use it best)
- Perhaps corrections for sampling error if my
y
are picked non-uniformly (they’re indeed kind of stratified)
I’m wondering if pseudo-labeling is a thing in continuous-y, bayesian world…
My goal is prediction, not inference, but I need the betas to make sense.
Would appreciate any links to good literature on the topic, especially if bayesian and thus usable in Stan.