Mixed-Density Parameter

Stan model such as one below compiles and runs but it doesn’t produce the right density for the single parameter.


transformed data{
  vector[1000] x;
  for (i in 1:1000) {
    x[i] = normal_rng(0, 1);
  }
}
parameters{
  real y;
}
model {
  for (i in 1:1000) {
    if(x[i] > 0) {
      y ~ normal(10, 2);
    } else {
      y ~ gamma(3, 3);
    }
  }
}

Here is the density plot from Stan sampling.
Rplot.pdf (8.8 KB)

This is what is expected.
Rplot01.pdf (10.4 KB)

The problem seems to be that you specify that y only has one dimension (real y), thus when you run the the model with multiple iterations, you’ll get the sampling distribution of the singe value of y. The following code will replicate your undesired figure:

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(tidyverse)

mJeet <- stan_model("path_to_model/mJeet.stan")

fit <- sampling(mJeet,
                refresh = 1e5)

as.data.frame(fit, pars = "y") %>% 
  ggplot(aes(x = y)) + 
  geom_histogram(bins = 80)

undesired
By moving everything to the generating quantities block and correctly specifying the dimensions of y, you can obtain your desired distribution:


data {
  int N; // sample size
}

generated quantities{
  vector[N] x;
  vector[N] y;
  
  
for (i in 1:N) {
  x[i] = normal_rng(0,1);

    if(x[i] > 0) {
      y[i] = normal_rng(10,2);
    } else {
      y[i] = gamma_rng(3, 3);
    }
  }  
}

mDiscourse <- stan_model("path_to_model/mDiscourse.stan")

fit <- sampling(mDiscourse,
                data = list(N = 1e3), 
                iter = 1,
                chains= 1,
                algorithm = "Fixed_param")

as.data.frame(fit, pars = "y") %>% 
  gather %>% 
  ggplot(aes(x = value)) + 
  geom_histogram(bins = 80)

1 Like

The question of being able to estimate a parameter with a mixture density is quite different from generate mixed-density data. Your code is doing the latter.

Sorry for the misunderstanding then… If you want to estimate y then you need to change two things in your model:

  • declare the parameter dimensions and set it to be higher then 0: real<lower = 0> y;

  • add to index on both y’s on the for loop in the model block.

data{
int N;
}

transformed data{
  vector[N] x;
  for (i in 1:N) {
    x[i] = normal_rng(0, 1);
  }
}
parameters{
  real<lower =0> y[N];
}
model {
  for (i in 1:N) {
    if(x[i] > 0) {
      y[i] ~ normal(10, 2);
    } else {
      y[i] ~ gamma(3, 3);
    }
  }
}

After fitting the model and looking at one of the sampled data sets, everything seems to in order.


dstan <- list(N=1e3)

fitJeet <- sampling(mJeet_fix, 
                    data = dstan, 
                    chain = 1)

dpred <- as.matrix(fitJeet, pars = "y") %>% 
 t() %>% as.data.frame()

colnames(dpred) <- 1:dstan$N

dpred %>% gather(rep, value, 1:dstan$N) %>%
  filter(rep == 1) %>% 
  ggplot(aes(x=value)) +
  geom_histogram(bins = 100)

desired

Not quite there yet. The above histogram isn’t the posterior density of single parameter.

I am not sure about what I am going to write, so take everything I say with a grain of salt.

I think what you are trying to achieve might not be possible in this stylized set-up. Because, you have 1000 values for x, and only 1 y, you are effectively averaging over the two distributions. You can think of it as: for x[1], y is drawn from the normal, for x[2], y is drawn (I know this is not how HMC/Stan works) from the gamma, and so forth. So the best guess for y is some kind of average over the normal(10,2) and the gamma(3,3) which gives you the Rplot.pdf.

If you want to have the nice separation of the distribution, you need to have the same amount of x's and ys. Just like @tiagocc wrote. Then for x[1], y[1] is always drawn from the normal, for x[2], y[2] is always drawn from the gamma, … As a result half of the y's will come from the normal and half will come from the gamma and you get the separate distributions in the histograms. The original stan model basically estimated the mean of the vector y with N elements.

In the spirit of what @stijn said, I am not totally sure this is the correct approach to your problem, but if the goal is to estimate a single parameter, I guess the solution would be to sample it from a finite mixture between the gamma and normal distributions you want, with a mixing proportion lambda

transformed data {
 real lambda = .5;  
}

parameters{
  real<lower = 0> y;
}

model {
  target += log_mix(lambda, 
                    normal_lpdf(y | 10, 2),
                    gamma_lpdf(y | 3, 3));
}

I define lambda as 0.5, since the cdf of the standard normal on 0 is \Phi(0) = 0.5, but you can still generate the samples from the standard normal and compute the ecdf at 0 on the transformed data block.

If you plot the posterior samples of y you’ll get something close to what I think you want.

The R code bellow will replicate the figture:

library(rstan)
options(mc.cores = parallel::detectCores())

fitJeet <- sampling(mix_jeet,
                    chains = 4,
                    iter = 10000, 
                    warmup = 2000,
                    control = list(adapt_delta = .99), 
                    seed = 132)

dfit <- as.data.frame(fitJeet, pars = "y")

library(ggplot2) 
ggplot(dfit, aes(x = y))+
  geom_histogram(bins = 100)

This is great, finally the question is settled.