# Likelihood for subsets of data

I have an interesting problem that I’m trying to determine understand how to model in Stan. The problem I’m having involves how to select and marginalize over different subsets of data in a probabilistic way.

I have a problem with two measured noisy variables y and z which have an associated order (y_0 to y_n). Over a particular range within the ordered data y_a to y_b, the two variables have a clear linear relationship (i.e. \frac{y-y_a}{z-z_a}=k). Outside of this range the data can take effectively any form. Here’s an image of a toy example of this kind of data, with the linear region highlighted in red.

The range of data for which the relationship holds is not apparent. It could be that none of the data follow a linear relationship (the case that a=b), or that all of them do (a=0, b=n). It seems that for a given k, the likelihood would be the sum of the likelihood of k for all possible combinations of a and b.

P(y,z|k)=\sum_{a=0}^n\sum_{b=a}^nP(y[a:b],z[a:b]|k)

Because n is not very large (~order 20), this seems like it could work and might be programmable in Stan as we can just treat each of the subsets as a separate set of data points. The problem is that selections with small numbers of points will have higher likelihoods than those with large numbers of points, whereas we expect longer linear runs to contain more information and would like to weight those higher.

Are there any good examples in the literature of parameterizing subsets of data like this or does anyone have any suggestions?

You could probably try something like here: https://mc-stan.org/docs/2_21/stan-users-guide/change-point-section.html, but you’d have to have an alternate model for the two sets of points at the end.

But looking at this plot it seems like things are gonna be sketchy whichever way you get it coded up.

Thanks for the suggestion,

I won’t go too far into the background here and will describe the problem in statistical terms, but I’ll put some links at the end for papers to read on the problem.

The underlying generative process for this type of data is rather complicated which is why they are typically presented in this way. Essentially our z data are generated under known physical conditions and are used to normalize our being used to normalize our y data which are generated under unknown physical conditions. The points at the start of the data vector generally represent a change in the set of physical conditions that the y data are experiencing.

This is probably possible to model, although not trivial. The points at the end of the will often come from an irreversible change in physical properties of the material being studied which is definitely not tractable as a model. The experiment effectively ‘failed’ at this point but the data prior to that are normally usable. There can also be an issue where the material itself violates the assumptions of the experiment leading to the points being non-independent, which will generally give a plot that is not linear anywhere. As such, the general approach is to search for a linear part of the data if one exists.

Something that has shown some promise in the past is an approach that bootstraps different subsets of points, see https://doi.org/10.1016/j.epsl.2011.08.024 for a detailed discussion. It would be nice to apply something similar to this in a probabilistic framework so that errors from individual sets of points can be propagated to a higher level in a hierarchical model. See also: https://doi.org/10.1029/2012GC004046 for a probabilistic forward model for an ideal experiment.

Well as much as you can model the head and the tail, try to figure that out. I think that’s pretty related to your intuition:

But I don’t think that this is true:

For instance, the points aren’t distributed uniformly [0, 1]x[0, 1].

I solved a detection problem recently, that involved comparing regions with different numbers of points. IMO, the thing to do is set up a hierarchical model. Each subset is fitted to generate some parameters for the same sub model, and then your comparison or reasoning can happen at the level of fitted parameters rather than the data. Kind of like the 8 school problem.