How to model a choice between 2 distributions

Hello! This is my first time posting on this forum so please go easy on me :).

I’m trying to understand the right way to describe a one-hot kind of model.

Each output datapoints is a vector of independent measurements. Every non-zero feature for every datapoint is either signal or noise, and there can only be 1 signal feature and the rest are noise.

I am assuming that the signal comes from a triangular distribution and the noise comes from a gamma distribution.

I’m trying to figure out how to specify a model for this (one latent feature comes from one distribution and all others come from the other distribution). Would this be something like a logistic representation?

The next step will be to turn this into a mixture model. I’ll be at StanCon this year and I want to get a ways into constructing my model so I can get more out of the tutorials.

example_image

[Please include Stan program and accompanying data if possible]example_data.csv (123.0 KB)

1 Like

The traditional way of doing this is to marginalize out the choice of model parameter. See the Stan manual section on “Change Point Models”.

I have coincidentally just written an alternative method to produce an approximate one-hot encoding solution that uses the Rebar distribution. See Finally A Way to Model Discrete Parameters in Stan and https://github.com/howardnewyork/rebar/blob/master/README.md

I do not think stan has a triangular distribution available so that would have to be manually coded.

If you define X to be a 70 element vector having a Rebar distribution (see github reference),where the i’th element is 1 for the signal feature and all other elements are zero for the noise, then you can write the likelihood statement as:

for (d in 1:D) { // loop within vector, D = dimension of the vector
target += ((1-X[d]) * normal_lpdf(y[i,d] | mu[1],sigma[1]) + X[d] * normal_lpdf(y[i, d] | mu[2], sigma[2])) * (y[i, d] == 0 ? 0 : 1);
}

The last part in brackets excludes zero valued data.

You can adjust the code to select two different families for noise and signal rather than just the Normal for both, but be careful to provide some structure, e.g. by setting the mean of noise distribution to be higher for one option, or use informative priors so as to avoid label switching errors. When I ran my code, it was still somewhat susceptible to label switching, so running a single chain rather than multiple chains is safer. I am not quite sure how to completely avoid this.

I also included optional code to use the standard marginalization approach. This approach gave unsatisfactory results. So it either just does not work very well or there is an error in the my alternative marginalization option code.

Hope this helps.

noise_and_signal.R (1.6 KB)
noise_and_signal_2.stan (2.0 KB)

1 Like

No. Runing a single chain is never safer than running multiple chains, and running a single chain doesn’t solve the label switching problem. You can use Stan with multiple chains also in the case of label switching, but then it’s likely that you need to make your own convergence diagnostics which take into account the label switching.

Please let us know, when you figure out which one is the reason. It would be helpful for obtaining evidence for usefulness of rebar.

1 Like

Chris:
Your problem is very similar to the toy problem in the paper ", “Sparsity information and regularization in the horseshoe and other shrinkage priors”, Juho Piironen and Aki Vehtari. The toy problem allows for multiple signal measures, whereas your problem allows for only a single signal.

I have written up the Rebar approach to solving this problem and it is described here. Finally A Way to Model Discrete Parameters in Stan

You can try both approaches, Rebar and Horshoe. I would be interested to find out which performs better on your actual data.

Thanks @howard. I’m working on adapting it now. I’ll let you know how it goes after I expand my example to multiple clusters.

We try to be nice to everybody!

Why? Is there something naturally generating a triangular distribution? We generally recommend softer priors (without hard interval cutoffs).

As others have said, you need to marginalize the decision about which is the hot parameter. This is pretty straightforward. There’ s a chapter in the manual on mixture models, and this is similar.

Thanks Bob! Looking forward to meeting you at the conferences this week :).

Yes, I’ve moved over into the marginalized distributions based on the examples for the mixture models and they are treating me well so far. I’m still having issues with the ragged simplexes but, that’s a different thread.

I’m assuming a triangular distribution because my variables are linearly more likely based on the independent variable. For example, let’s say the number of potholes on a given street. If a street is twice as long it most likely has twice as many potholes. Do you think there is a better way to model this phenomenon? I’ve been trying to figure it out but so far a triangular probability seems to be the best bet.

Perhaps an exposure term in a Poisson? This is common with hierarchical count data—you model something like rate per capita hierarchically, then you model each region using a Poisson based on population, y[i] ~ poisson(population[i] * rate[i]), where population[i] is data here.

1 Like

If you wouldn’t mind @Bob_Carpenter, could you expand on this exposure term a little more? I’m really struggling with imagining this for continuous variables.

Let’s say my data is 5 streets:

Street   Length (miles)
1           1.0
2           1.0
3           1.0
4           1.0
5           4.0

If somebody tells me that there is a single pothole somewhere, how can I capture the fact that there is a 50% chance that the pothole is on Street #5?

Often you model rates, like number of potholes per mile of road, say rate. Then for each road i, you have a length, length[i], then you’d model the number of potholes on road i as y[i] ~ poisson(rate * length[i]). With random effects, you might get varying rates rate[i] and give them a hierarchical prior, rate[i] ~ foo(...). Having a separate exposure term, here the length[i] term, which is data, lets you have the rate paramters all be on the same scale.

Are you literally trying to do this inference?

Pr[pothole is on street 5 | pothole is on 1, 2, 3, 4, or 5]

Then it depends on your model. But if it’s just simple Poisson with exposure, we happen to know because of the Poisson’s homogeneity that if u ~ Poisson(a) and v ~ poisson(b), then u + v ~ Poisson(a + b). This means the count is going to be proportional to the rate parameter, so you can just derive this result analytically. You don’t need sampling.

In general, to do Bayesian posterior predictive inference in the general case, you’d set up the five variables y[1], ..., y[5] for the number of potholes on each road, generate them in generated quantities, then take the proportion. That works for arbitrary distributions.

1 Like