How to add weights and constrains to Multilevel model

fitting-issues
specification

#1

I have some messy data that I’m modeling. There is a continuous outcome, continuous predictor and a categorical variable. In my real model I will have other predictors but for this demonstration I only include 2 predictors and the weights because they capture my problem.

The goal of this analysis is to make inferences on the group variable and allow for visualization of the relationships.
My two questions are about weights and a constraint:

  1. The outcome needs to be weighted. This can be handled in lm() or lme4() in R but how can weights be included in the stan multilevel code below ? If I had a weights vector, would I just change y[i] ~ normal(mu, sigma); to be y[i] ~ normal(mu, sigma/weights[i]);?

  2. The outcome, y, is bounded by the x predictor. Y has the relationship to x: y >= -1x. Not accounting for this constraint causes erroneous implausible predictions. You can see this if you run the R code below. It shows when predictions are made using the lm() model because the constraint is not accounted for the predictions are outside the bound. See the 2nd r plot below. I haven’t yet found a way to model the bound in a non-bayesian way using lm() or lme4(). Is this a case for gradient descent regression with a custom loss function? Any ideas? Given that I haven’t figure out a non bayesian approach to handle the constraint, I’m here because I’m wondering if that constraint, y >= - 1x, be incorporated to the stan multilevel model code below somehow?

HERE IS R CODE that uses lm() just for comparison

    x =   rep(seq(-10,10,1),each=5) 
    y =  rep(0,length(x) )
    weights = sample( seq(1,1000,1) ,length(x))
    groups = rep( letters[1:5], times =length(x)/5  ) 

## the outcome y has the relationship y >= -x
for(i in 1:length(x)){
  y[i] = sample(  seq(-x[i],-x[i]+ ifelse(x[i]<0 , 
                                          sample(seq(1,10,.1),1),
                                          sample(seq(5,40,.1),1) ),1)
                                          
                      ,1)
} 
 
dat =data.frame(x = x, y = y, weights = weights, group = groups)
model = lm(dat = dat, y ~ x + group + group * x, weights = weights)
dat$FIT= predict(model, dat, allow.new.levels = TRUE)

library(ggplot2)

## you can see the fit of the model for each group
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
geom_line(data = dat, aes(x = x, y = FIT, color = group))

#data for new predictions
newdata = data.frame( x = seq(-100,100,1) , 
                      group = rep("c", length( seq(-100,100,1) )),
                      bound = -seq(-100,100,1)
                      )

## when predicting new data the outcome y should have the  relationship y >= -x
## but the predicted values using lm() are below the bound. See how red line is below blue to the left of zero
new_preds  = predict(model, newdata, allow.new.levels = TRUE)
ggplot(data = newdata, aes(x = x, y = new_preds, color = group))+geom_line()  +
  geom_line(data = newdata, aes(x = x, y = bound, color = "green"))  

STAN CODE WITH MULTILEVEL MODEL but weights and constraints are not built in
# below is a mulitlevel model using stan but can weights and the constrain be built in to avoid erroneous predictions

library(rstan)
head(dat)
# create a numeric vector to indicate the categorical groups
dat$GROUP_ID  = match(  dat$group, levels( dat$group   )  ) 
library(rstan)
standat <- list(
  N = nrow(dat),
  y = dat$y,
  x = dat$x,
  GROUP_ID = dat$GROUP_ID,
  nGROUPS = 5
)


stanmodelcode = '
data {
int<lower=1> N; 
int nGROUPS;
real y[N];                                 
real x[N];   
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}


transformed data{

}

parameters {
real intercept;  
vector[nGROUPS] intercept_x2;
real beta;                   
real<lower=0> sigma;    
real<lower=0>  sigma_2;
}

transformed parameters {  // none needed
}

model {
real mu;

// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);

// likelihood

for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] +  beta*x[i];
y[i] ~ normal(mu, sigma);
}

}
'

fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22

#2

Your first plot shows that the data for x is between -10 and 10, your second plot goes from -100 to 100. It looks like the condition might hold for x in [-10, 10]. You’re probably asking too much of a linear model to extrapolate outside the range of your data.

The weights can be incorporated as

for(i in 1:N){
  mu = intercept + intercept_x2[ GROUP_ID[i] ] +  beta*x[i];
  target =+ normal_lpdf(y[i] | mu, sigma) * weights[i] ;
}

See also Survey weighted regression

Have a look at the truncated or censored data section in the Stan Manual on how to implement the constraint.


#3

This is just toy data for an example. The real data is unbounded in x.Can you provide the ink the stan manual? It’s surprisingly difficult to locate.


#4

http://mc-stan.org/users/documentation/index.html

It’s the first link.

It does not really matter that this is toy data. In general, you have to make very strong assumptions to extrapolate beyond the data. Furthermore, if you have a linear constraint between x and y and you estimate a linear model between x and y, they need to have the exact same slopes for them to never cross (parallel lines).


#5

I’vs also seen normal(mu, sigma/weights[i])