Using bigger sample for X than I have ys

valyagolev · August 12, 2025, 1:39pm

I hope this isn’t too off-topic.

I am regressing continuous y on multi-dimensional X, and my sample for X is way bigger than I have y labeled. I’m wondering what kinds of approaches can be there to use the extra data. Unfortunately I was unable to formulate google query to help me out, as all I keep finding is about imputations for missing X (not my problem).

I’m pretty sure at least a few things can help me:

Using bigger samples to standardize X (already done)
Better covariance matrix (not 100% sure how to use it best)
Perhaps corrections for sampling error if my y are picked non-uniformly (they’re indeed kind of stratified)

I’m wondering if pseudo-labeling is a thing in continuous-y, bayesian world…

My goal is prediction, not inference, but I need the betas to make sense.

Would appreciate any links to good literature on the topic, especially if bayesian and thus usable in Stan.

Topic		Replies	Views
Post-Hoc Prediction: Use samples from fitted model in a prediction task Modeling techniques , specification	31	807	May 4, 2024
Estimating measurement error model mean and variance from data Modeling specification	4	499	February 19, 2020
Data imputation/missing data in a correlation model Modeling	6	1328	December 28, 2018
A hard modeling problem using Bayesian inference Modeling	23	1990	May 31, 2017
Model inputs as parameters of a distribution PyStan	14	1079	September 19, 2021

Using bigger sample for X than I have ys

Related topics