PARSER EXPECTED: <distribution and parameters>

I am trying to do some deconvolution of tissue.

I have a tissue sample, which is a combination of normal tissue and cancer.
I am modelling it as a mixture of normal distributions, assuming that the methylation profile of cancer and normal tissue follows relatively narrow normal distributions. That is:

M = \alpha N (\mu_{\text{cancer}}, \sigma_{\text{cancer}}) + (1 - \alpha) N (\mu_{\text{normal}}, \sigma_{\text{normal}})

where \mu_{\text{cancer}}, \sigma_{\text{cancer}} is an unknown mean and standard deviation of the cancer methylation profile, and \mu_{\text{normal}}, \sigma_{\text{normal}} are the known parameters of the normal healthy tissue. The mixture proportion parameter \alpha is unknown.

Note that this is different from the example mixture distribution, where the samples can belong to one or the other distribution. Here the Mixture is a sum of both.

My STAN code is this:

    data {
    int<lower=1> S; // number of samples
    int<lower=1> P; // number of probes
    matrix[S, P] mixed_tissue; // mixed tissue
    vector[P] mu_lung; // mean of lung, estimated from data
    vector[P] sd_lung; // sd of lung, estimated from data

    parameters {
    real<lower=0, upper=1> alpha; // cancer cell fraction
    vector[P] mu_cancer; // mean of cancer
    vector[P] sd_cancer; // sd of cancer

    model {
    // using for cycle:
    for (s in 1:S) {
        for(p in 1:P) {
            mixed_tissue[s, p] ~ alpha[s] * normal(mu_cancer[p], sd_cancer[p]) + (1 - alpha[s]) * normal(mu_lung[p], sd_lung[p]);

Yet, I get a non-explanatory error “PARSER EXPECTED: ” pointing at mu_lung. I don’t understand what it means, mu_lung is a fixed value, nothing interesting, nothing required to sample/estimate. In other examples, fixed values are explicitly passed.

Thanks for help.
ps. first time using STAN. Not yet familiar with the target notation.

Was there also something else given with the error msg?

Did you give the data in correct format?

Full error message:

 error in 'model433c049566996_entex' at line 20, column 33
    18:     for (s in 1:S) {
    19:         for(p in 1:P) {
    20:             mixed_tissue[s, p] ~ alpha[s] * normal(mu_cancer[p], sd_cancer[p]) + (1 - alpha[s]) * normal(mu_lung[p], sd_lung[p]);
    21:             }

PARSER EXPECTED: <distribution and parameters>
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 

Haven’t thought about the format of the data. I am passing some data.table columns instead of R objects. I converted it into a native R object matrix, but this didn’t solve the issue.

Is the notation correct? I am looking at the documentation and the notation ~ is typically used as:

y ~ distribution(parameters)

not as

y ~ distribution(parameters) + distribution(parameters) + values

(see 1.1 Linear regression | Stan User’s Guide )

For more complex models, the target += notation is used instead, but I am not yet comfortable specifying models this way. Might that be the issue?

I think this is your problem. The ~ syntax is used only when a distribution is on the right, not a complex expression.

You can write your own l[u]pdf function, or replace it with an equivalent target +=.

Thank you Brian.

This seems like something documentation should mention explicitly.

Could you please suggest how should I rewrite the model using the target += notation? I am still unfamiliar with it, as well with the f(y|x) notation (still not sure if it is standard conditional probability or some special stan notation).

A statement like a ~ normal(b, c) is perfectly equivalent to target += normal_lupdf(a | b, c)

So, I think your above statement would be best translated as something like

target += alpha[s] * normal_lupdf(mixed_tissue[s, p] | mu_cancer[p], sd_cancer[p]);
target += (1 - alpha[s]) * normal_lupdf(mixed_tissue[s, p] | mu_lung[p], sd_lung[p]);

This still fails because you’re trying to index into alpha, which is not an array/vector, but if you fix that it compiles (make sure it means what you want, though!)

Thanks Brian

I will certainly fix the alpha, but the way you wrote it is essentially the model mentioned in the chapter:

which reads to me that individual data point in mixed_tissue can come with the probability alpha from cancer tissue OR with probability 1 - alpha from the normal tissue, which is not what I want (and the reason I wrote it the way I wrote it with ~ in the first place).

The issue is that the normal distribution are not conditional on mixed_tissue data directly, but their sum is. (i.e., it is not N(m | mu_1, sd_1) + N(m | mu_2, sd_2) but m = a + b = N(a | mu_1, sd_1) + N(b | mu_2, sd_2). At least as far as I understand the | notation.

Am I understanding this correctly?


I am not sure how you could express that model (not saying it’s impossible, I just don’t know!)

Welp, I restated model through: Z = aX + bY; Z ~ N(a*mu_x + b*mu_y, sqrt( (a*mu_x)^2 + (b*mu_y)^2 )), see i.e.,: Sum of normally distributed random variables - Wikipedia

and now it runs, but it runs out of memory. I guess I need to reduce the probes from 850 000 to something a bit smaller. :)

At least the maximum likelihood optimizing runs well.

Thanks Brian
– Jirka

1 Like