Fitting a discrete integer mixture model within the Stan paradigm

Hi all,

I’m a new Stan user learning the paradigm for expressing models as Stan programs.

Here’s one scenario I’m struggling with: a “privacy algorithm” example from Cameron Davidson-Pilon’s “Probabilistic Programming & Bayesian Methods for Hackers” book. I’m trying to translate this pymc3 example into Stan but it doesn’t seem to fit well within the paradigm.

In this scenario, researchers ask each student whether they cheated on an exam. To respond, each student flips a coin:

  • If heads, they reply truthfully whether they cheated.
  • If tails, they flip the coin again, and reply yes if heads, no if tails.

Given data of number of students and observed probability of “yes” answers, I want to estimate the true frequency of cheating.

Below is how I am approaching this with pystan and the error I am getting: “attempt to assign variable in wrong block. left-hand-side variable origin=data”.

Could you please help me understand how to write such a model with a generated quantity correctly in Stan? How do I model the computation of pobs? My confusion is two-fold:

  • Perhaps I should move pobs declaration out of data and into model, but then I’m not sure how to pass in this observed frequency of cheating.
  • How to properly model a discrete integer, in this case the mixture sum of the students’ flips?

Python code and error follows.

Thanks!

flip_code = """
functions {}
data {
    int<lower=0> N; // number of students
    real pobs; // observed proportion of "yes" answers
    //int x; // observed number of students who reply yes 
    // do we need to pass in observed value for each student, i.e. an int observed[N]?
}
transformed data {} // only evaluated once
parameters {
    real<lower=0,upper=1> p; // true probability of cheating
}
transformed parameters { // executed once per iteration
} 
model {
    // for each student:
    //int<lower=0,upper=1> truths[N]; // can't have variable constraints in model?
    int truths[N]; // true cheating status
    int first_flips[N]; // first coin flip result
    int second_flips[N]; // second coin flip result
    real val[N]; // observed value. real so that can do floating point division below.
    
    p ~ uniform(0,1); // prior on true probability of cheating
    for (i in 1:N) {
        truths[i] ~ bernoulli(p);
        first_flips[i] ~ bernoulli(0.5); // fair coin flip
        second_flips[i] ~ bernoulli(0.5); // fair coin flip
        val[i] = first_flips[i] * truths[i] + (1 - first_flips[i])*second_flips[i]; // observed value from privacy algorithm
    }
    pobs = sum(val) / N; // note that we force floating point division here
    //x = sum(val);
}
generated quantities {}
"""

N=100.
x=35.
pobs=x/N
flip_data = {'N': N, 'pobs': pobs }

flip_fit = pystan.stan(model_code=flip_code, data=flip_data,
                  iter=1000, chains=4)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-beba03c597ad> in <module>()
     42 
     43 flip_fit = pystan.stan(model_code=flip_code, data=flip_data,
---> 44                   iter=1000, chains=4)

/home/maxim/anaconda2/envs/infil/lib/python3.6/site-packages/pystan/api.py in stan(file, model_name, model_code, fit, data, pars, chains, iter, warmup, thin, init, seed, algorithm, control, sample_file, diagnostic_file, verbose, boost_lib, eigen_lib, n_jobs, **kwargs)
    377         m = StanModel(file=file, model_name=model_name, model_code=model_code,
    378                       boost_lib=boost_lib, eigen_lib=eigen_lib,
--> 379                       obfuscate_model_name=obfuscate_model_name, verbose=verbose)
    380     # check that arguments in kwargs are valid
    381     valid_args = {"chain_id", "init_r", "test_grad", "append_samples", "enable_random_init",

/home/maxim/anaconda2/envs/infil/lib/python3.6/site-packages/pystan/model.py in __init__(self, file, charset, model_name, model_code, stanc_ret, boost_lib, eigen_lib, verbose, obfuscate_model_name, extra_compile_args)
    209                                          model_name=model_name,
    210                                          verbose=verbose,
--> 211                                          obfuscate_model_name=obfuscate_model_name)
    212 
    213         if not isinstance(stanc_ret, dict):

/home/maxim/anaconda2/envs/infil/lib/python3.6/site-packages/pystan/api.py in stanc(file, charset, model_code, model_name, verbose, obfuscate_model_name)
    127     if result['status'] == -1:  # EXCEPTION_RC is -1
    128         error_msg = "Failed to parse Stan model '{}'. Error message:\n{}".format(model_name, result['msg'])
--> 129         raise ValueError(error_msg)
    130     elif result['status'] == 0:  # SUCCESS_RC is 0
    131         logger.debug("Successfully parsed Stan model '{}'.".format(model_name))

ValueError: Failed to parse Stan model 'anon_model_9db1e01b1721a4e4e70a05b89eeea5d9'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    truths[i] ~ bernoulli(...)
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    first_flips[i] ~ bernoulli(...)
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    second_flips[i] ~ bernoulli(...)
attempt to assign variable in wrong block. left-hand-side variable origin=data

ERROR at line 30

 28:            val[i] = first_flips[i] * truths[i] + (1 - first_flips[i])*second_flips[i];
 29:        }
 30:        pobs = sum(val) / N; // note that want to force floating point division here
                  ^
 31:        //x = sum(val);

PARSER EXPECTED: <expression assignable to left-hand side>

You need to write out the posterior density which will include a 2-component mixture. Then check out the mixture chapter in the manual to see how to get rid of the discrete variable. In short something like (unedited, written once, likely incorrect somewhere):

Z1 is first coin flip
Z2 is second coin flip
P[answer] = P[Z1=heads] * P[answer| Z1=heads] + (1-P[Z1-heads]) * P[answer|Z1=tails]
P[answer|Z1=tails]= P[answer|Z2=heads] + P[answer|Z2=tails] = 1

Then priors, the posterior as usual. When you do this over many observations you get a likelihood product scaled as ^P[Z1=heads] so the rarer truth is the flatter the posterior (and therefor less truth in answers means less information about truth). I remember this from a stats class way back, I think Michael Lavine made us write our own samplers for it. :)

I thought this was an interesting problem so decided to take a crack at it. Hope this is helpful.

Would be curious to know, is this how others would approach this problem? It’s a little different from other approaches I’ve seen in that it handles the mixture in the transformed parameters rather than the model block.

Stan code

data {
    int<lower=0> N; // number of students
    real pobs; // observed proportion of "yes" answers
    int<lower=0,upper=N> num_yes; // observed number of yes answers
}
transformed data { 
    real<lower=0,upper=1> flip1; // prob first coin flip (fair)
    real<lower=0,upper=1> flip2; // prob second coin flip (fair)
    real<lower=0> alpha_flip1;   // reparameterization for flip1
    real<lower=0> beta_flip1;       
    flip1 = 0.5;
    flip2 = 0.5;
    alpha_flip1 = N * flip1;
    beta_flip1 = N * (1 - flip1);
}
parameters {
    real<lower=0, upper=1> p_cheating; // true rate of cheating ("yes" answer)
    real<lower=0, upper=1> prop_flip1; // observed prop of first-flips that are success -> answer truth 
    real<lower=0, upper=1> prop_flip2; // observed prop of second-flips that are success -> answer "yes"
}
// reparameterize beta distribution
// as in section 20.2 of stan manual (v 2.14.0)
transformed parameters {
    real<lower=0, upper=N> lambda2;   // number of flip2 trials
    real<lower=0> alpha_flip2;        // number of 2nd flip successes
    real<lower=0> beta_flip2;         // number of 2nd flip failures
    real<lower=0, upper=1> theta;     // parameter for overall rate of yes (analogous to pobs)
    
    // number of second-flip trials, based on prop observed for 1st flip
    lambda2 = N * (1 - prop_flip1);    
    
    // parameters for second flip
    alpha_flip2 = lambda2 * flip2;     // second answer based on flip2 result
    beta_flip2 = lambda2 * (1 - flip2);
    
    // compute theta as weighted average
    theta = prop_flip1*p_cheating + (1 - prop_flip1)*prop_flip2;
}
model {
    // model probability of observing heads / tails on each flip
    // given number of trials
    prop_flip1 ~ beta(alpha_flip1, beta_flip1);
    prop_flip2 ~ beta(alpha_flip2, beta_flip2);
    
    // prior on cheating rate
    // (not necessary; included here for completeness)
    p_cheating ~ uniform(0, 1);
    
    // number of observed "yes" responses
    num_yes ~ binomial(N, theta);
}

Fitting to data

Using pystan:

N=100
x=35
pobs=x/N
flip_data = {'N': N, 'pobs': pobs, 'num_yes': x}

flip_fit3 = pystan.stan(file='flip_model-beta.stan',
                        data=flip_data,
                        iter=2000,
                        chains=4)
flip_fit3

Output:

Inference for Stan model: anon_model_454584dc776870d11b84a4445bd48eb0.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p_cheating    0.21  2.5e-3   0.11   0.02   0.13   0.21   0.28   0.44   1864    1.0
prop_flip1     0.5  1.0e-3   0.05    0.4   0.46    0.5   0.53   0.59   2100    1.0
prop_flip2     0.5  1.5e-3   0.07   0.36   0.45    0.5   0.54   0.62   2059    1.0
lambda2      50.37     0.1   4.77  40.59  47.29   50.4  53.55  59.54   2100    1.0
alpha_flip2  25.18    0.05   2.38   20.3  23.65   25.2  26.77  29.77   2100    1.0
beta_flip2   25.18    0.05   2.38   20.3  23.65   25.2  26.77  29.77   2100    1.0
theta         0.36  7.0e-4   0.04   0.28   0.32   0.35   0.38   0.45   4000    1.0
lp__        -137.0    0.04   1.32 -140.5 -137.6 -136.7 -136.1 -135.5   1378    1.0

Samples were drawn using NUTS at Tue Apr 11 20:18:48 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
1 Like

@sakrejda and @jackinovik: Thank you both for your replies!

My further testing of @jackinovik’s approach with simulated data confirms it works. Super helpful.

I second @jackinovik’s question: does this modeling approach follow the best practices of the Stan paradigm? My goal in starting this thread is not just to resolve this specific question, but to improve my understanding of the “proper” way to use Stan.

Thanks again.

We tend not to fit these kinds of discrete models, so Stan’s not well tailored to expressing them. Honestly, I don’t know where people are getting them from!

As you saw in the first failed program, we don’t allow discrete parameters.

Jacki’s approach is definitely the way to go for this simple problem.

If you wind up getting long sequences of flips with latent states, you will need to code a general state-space model like the HMM example in the manual—otherwise, the marginalization will be intractable.

After the Davidson-Pilon book, you might enjoy McElreath’s Statistical Rethinking if you want a deeper hands-on intro to Bayesian stats. (It’s not free, though, and it hides the Stan programs under an abstraction layer.)

1 Like

I have to disagree—it’s totally fine for expressing them. He needs to do basically what you do for the mixture models in the manual. Jacki’s approach is really direct, but you could also just code the density using our basic math functions and it wouldn’t be much code.

I just meant expressing models like this in Stan is clunky compared to having discrete parameters. It also puts a heavy burden on the user to do the marginalization.

The upside is that you can actually fit the models, whereas a lot of the discrete samplers just get locked up if there’s correlation among the discrete parameters.

I did a lot of work on the Dawid-Skene type annotation/rating/coding models (example in the discrete chapter of manual) and they’re easier to express in BUGS, but BUGS or JAGS take roughly forever to fit them even with good inits. We had the same experience with the Cormack-Jolly-Seber models for mark/recapture data—Stan can fit the marginalized model well in minutes whereas it takes a long long time to converge in BUGS and mixing is poor. Turns out you can do the marginalizations in BUGS, too, as Cole Monahan showed, and get a huge speedup there, too.

For some models the marginalization is really a pain but for small stuff like this it’s pretty straightforward (most users can handle sums with a tutorial) and the performance gain, as you say, is amazing. We should be touting this rather than being worried that you can’t state the model in a way that’s basically bonkers anyway. … although I’m sure someday we could come up with syntax for discrete variables that compiles to the summed version…

That would be awesome!

That would be a challenge within a graphical modeling language and pretty much impossible to do in general for Stan. The problem is that we need to do the marginalizations intelligently by moving them inside scopes to the extent possible. Stan’s general control structures prevent us from doing that in general, whereas a graphical modeling langauge would give us an algebraic structure (though I imagine multivariates and loops will still be challenging if they’re not coded in the graph structure rather than being expanded as they are in BUGS/JAGS).