Modeling Missing Data

I have a 2-step generative model.

  1. vector i ~ normal(mu, std)
  2. I algebraically transform each of the entries of vector i onto a scale of 0-1 using a sigmoid function.
  3. Then I draw a new variable vector y ~ Bernoulli( sigmoid(i) )
  4. The observed data, vector s, is equal to: { i if y=1 } and { 0 if y=0 }

I hope to draw samples for the parameters mu and std, given I observe the vector of data s. I have coded the model as follows, but it is giving me a lot of divergences. (Even though I am following non-centered (centered?) implementation.) I am also not sure if I have coded my generative process correctly.


"""
functions {
    real sigm(real x){
        return (1 / (1 + exp(-0.92*(x-8.24)));
    }
    
    real signal_lpdf(real s, real i, real mu, real std){ 
        if (s==0) {
            return bernoulli_lpmf(0 | sigm(i)) + normal_lpdf((i-mu)/std | 0, 1);
            } 
        else {
            return bernoulli_lpmf(1 | sigm(s))+ normal_lpdf((s-mu)/std | 0, 1);
            }
    }
}

data {
     int<lower=0> nPeps;       // number of theoretical peptides in protein
     real<lower=0> s[nPeps];   // array holding signal strength for that protein (0 is missing value)
}

parameters {
    real mu;
    real i;
    real<lower=0> std;
}

transformed parameters {
real mu2;
real<lower=0> std2;
mu2 = mu*10+7.15;
std2 = std*10+3;
}

model { 
    mu ~ std_normal(); 
    std ~ std_normal();
    for (j in 1:nPeps) {
        s[j] ~ signal(i, mu2, std2);
    }
}
"""
1 Like

HMC doesn’t like discontinuity ; so I reckon your IF statement in the signal function is part of your problem.

You can use an inverse logit to make the transition between your two densities continuous. This thread uses the trick:

3 Likes

Oooh that looks fun. Thanks for sharing!

1 Like