Latent integer vector

Hi,

I am a previous \textsc{bugs} user. I have a moderate size data (650,000 2 dimensional samples). I aim to fit the distribution of its log ratio log(x/y). The size of data makes it infeasible to simulate using \textsc{bugs}.

By inspection, the distribution of ratio is like 7% zeros and the rest 93% follows a double exponential distribution with mean zero. The state is latent. So I aim to model the % and the scale of double exponential distribution. The model is very simple. However, the only difficulty I have is that \textsc{stan} does not support integer vector. So \texttt{int<lower=0,upper=1> x[N]; } would give me an error.

I have read the chapter latent discrete parameters but it does not help much.

data {
        int<lower = 0> N;
        vector[N] y;
}

parameters {
        real<lower=0, upper=.3> sigma1;
        real<lower=0, upper=1> theta;
        int<lower=0,upper=1> x[N];
        vector[N] mark;
}

model {
        x ~ bernoulli(theta);
        mark~ double_exponential(0, sigma1);
        y=x.*mark;
}

I manage to get around with it by using a uniform distribution with a tight domain (-0.000005,0.000005) and a mixture model but it is a overkill I guess.

data {
        int<lower = 0> N;
        vector[N] y;
}

parameters {
        real<lower=0, upper=.3> sigma1;
        real<lower=0, upper=1> theta;
}

model {
        for (n in 1:N)
                target += log_mix(theta,
                                  double_exponential_lpdf(y[n] | 0, sigma1),
                                  uniform_lpdf(y[n] | -0.005, 0.005));
}

I think the thing you want here is the mixture model.

So if you have:

p(x) = \theta p_1(x) + (1 - \theta) p_2(x)

And say p_1 is just the normal density and p_2 is defined as

p_2(x = 0) = 1 and p_2(x \ne 0) = 0

Then:

p(x = 0) = \theta p_1(0) + (1 - \theta) and
p(x \ne 0) = \theta p_1(x)

So you can write that in Stan with:

for(n in 1:N) {
  if(y[n] == 0) {
    target += log_sum_exp(log(theta) + normal_lpdf(0 | 0, sigma1), log1m(theta));
  } else {
    target += log(theta) + normal_lpdf(y[n] | 0, sigma1);
  }
}

That make sense?

1 Like

Hi bbbales2,

Thx for your help! You code indeed compiles but it gives a result of theta=0. I am not sure why. However, it seems that you also think I should fit a mixture model. I will still use my ‘overkill’

data {
        int<lower = 0> N;
        vector[N] y;
}

parameters {
        real<lower=0, upper=.3> sigma1;
        real<lower=0, upper=1> theta;
}

model {
        for (n in 1:N)
                target += log_mix(theta,
                                  double_exponential_lpdf(y[n] | 0, sigma1),
                                  uniform_lpdf(y[n] | -0.0005, 0.0005));
}

The idea here is that, in the dataset, the only sample between -0.0005, 0.0005 are exactly zero.

Thanks for your help again!

Cheers,

Chen

1 Like

Hah, I guess I’m confusing pdfs and pmfs, go me. Good to hear what you have works though! I don’t see an obvious way to make it better.

What I really want to do is a correct version of the following. But I am pretty sure it does not make sense to sum pdf with pmf?

model {
for(n in 1:N) {
  if(abs(y[n])<1e-5) {
                    target += log_sum_exp(bernoulli_lpmf(1 | theta),
                        bernoulli_lpmf(0 | theta)
                        + double_exponential_lpdf(y[n] | 0, sigma1));
                } else {
                        target += bernoulli_lpmf(0 | theta)
                                  +double_exponential_lpdf(y[n] | 0, sigma1);

                }
}
}

In the manual there is a zero inflated Poisson model. What I need is a zero inflated double exponential model. The fact double_exponential is continuous makes this problem a bit messy.

Yeah, it seems a little screwy, but I’m not sure in what way. I’ll look at this tomorrow. Sorry for not giving a clear answer.

It sure seems like with the uniform the width of the uniform is going to affect the results, which seems wrong.

I believe the mixture approach as proposed by @bbbales2 in the 2nd post should work with a small correction. When the observation model is a mixture of a continuous component and a discrete component, the likelihood needs to be defined so that p(y \mid \sigma,\theta) evaluated at y=0 is just Pr(y=0 \mid \sigma, \theta), which in turn equals 1-\theta (conditionally on drawing from the continuous part of the mixture, getting exactly 0 has probability 0). (Attempt at theoretic justification at the end of this post). So, defining the parameters/priors as in @cc20002002’s latest post, we get

data {
  int<lower = 0> N;
  vector[N] y;
}

parameters {
  real<lower=0, upper=.3> sigma1;
  real<lower=0, upper=1> theta;
}

model {
  for (n in 1:N) {
    if(y[n] == 0) {
      target += log1m(theta);
    } else {
      target += log(theta) + double_exponential_lpdf(y[n] | 0, sigma1);
    }
  }
} 

(In the real model, you probably want a weakly-informative prior rather than hard constraints for \sigma). I tested this with simulated data with N=10000,\sigma=0.1,\theta=0.93 and seemed to recover correct parameters (no other tests done).

Theory

The probability distribution of y (given parameters) does not have a density wrt. the Lebesgue measure (“continuous pdf”) nor a density wrt. the counting measure on integers (“discrete pmf”). We define the likelihood as the density with respect to a reference measure that is the sum of the Lebesgue measure and an atom of measure 1 at 0. When the density-of-interest is written as a mixture of the continuous component and the zero-component,

p(y) = \theta\,p_1(y) + (1-\theta)p_2(y),

the density of the continuous component (against our reference measure!), p_1(y), is 0 at y=0 and equals the “plain” double-exponential-pdf at any y\neq 0. The discrete component p_2(y) is 1 at y=0 and 0 elsewhere.

3 Likes

Thanks for the correction!

And being thorough!

@JuhoKokkala

Thanks a lot for your help! Could you explain why a non-informative prior is better in this case?

Yours sincerely,

Chen