I have some messy data that I’m modeling. There is a continuous outcome, continuous predictor and a categorical variable. In my real model I will have other predictors but for this demonstration I only include 2 predictors and the weights because they capture my problem.
The goal of this analysis is to make inferences on the group variable and allow for visualization of the relationships.
My two questions are about weights and a constraint:
-
The outcome needs to be weighted. This can be handled in lm() or lme4() in R but how can weights be included in the stan multilevel code below ? If I had a weights vector, would I just change y[i] ~ normal(mu, sigma); to be y[i] ~ normal(mu, sigma/weights[i]);?
-
The outcome, y, is bounded by the x predictor. Y has the relationship to x: y >= -1x. Not accounting for this constraint causes erroneous implausible predictions. You can see this if you run the R code below. It shows when predictions are made using the lm() model because the constraint is not accounted for the predictions are outside the bound. See the 2nd r plot below. I haven’t yet found a way to model the bound in a non-bayesian way using lm() or lme4(). Is this a case for gradient descent regression with a custom loss function? Any ideas? Given that I haven’t figure out a non bayesian approach to handle the constraint, I’m here because I’m wondering if that constraint, y >= - 1x, be incorporated to the stan multilevel model code below somehow?
HERE IS R CODE that uses lm() just for comparison
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,1000,1) ,length(x))
groups = rep( letters[1:5], times =length(x)/5 )
## the outcome y has the relationship y >= -x
for(i in 1:length(x)){
y[i] = sample( seq(-x[i],-x[i]+ ifelse(x[i]<0 ,
sample(seq(1,10,.1),1),
sample(seq(5,40,.1),1) ),1)
,1)
}
dat =data.frame(x = x, y = y, weights = weights, group = groups)
model = lm(dat = dat, y ~ x + group + group * x, weights = weights)
dat$FIT= predict(model, dat, allow.new.levels = TRUE)
library(ggplot2)
## you can see the fit of the model for each group
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
geom_line(data = dat, aes(x = x, y = FIT, color = group))
#data for new predictions
newdata = data.frame( x = seq(-100,100,1) ,
group = rep("c", length( seq(-100,100,1) )),
bound = -seq(-100,100,1)
)
## when predicting new data the outcome y should have the relationship y >= -x
## but the predicted values using lm() are below the bound. See how red line is below blue to the left of zero
new_preds = predict(model, newdata, allow.new.levels = TRUE)
ggplot(data = newdata, aes(x = x, y = new_preds, color = group))+geom_line() +
geom_line(data = newdata, aes(x = x, y = bound, color = "green"))
STAN CODE WITH MULTILEVEL MODEL but weights and constraints are not built in
# below is a mulitlevel model using stan but can weights and the constrain be built in to avoid erroneous predictions
library(rstan)
head(dat)
# create a numeric vector to indicate the categorical groups
dat$GROUP_ID = match( dat$group, levels( dat$group ) )
library(rstan)
standat <- list(
N = nrow(dat),
y = dat$y,
x = dat$x,
GROUP_ID = dat$GROUP_ID,
nGROUPS = 5
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22