Feature request: copulas for multivariate responses of mixed types


#1

Paul I see that you list general multivariate models as an area for future development in brms. In randomized clinical trials there is a significant need for multivariate modeling of mixtures of univariate outcomes, including binary, ordinal, continuous, and time-to-event outcomes. Multivariate copulas seem to be the best way to go because researchers wish to get the usual marginal interpretation of treatment effect for each of the component outcomes. This paper by Costa and Drury is excellent, relating to a bivariate situation - a continuous outcome and a binary outcome joined with a copula. They even go so far as to allow the copula dependence parameter to vary by treatment, as in the placebo group the two responses can be decoupled (that would be a feature for the more distant future). I hope you’ll consider implementing copulas, at least ones with dependence parameters with specified priors but the parameter not varying with treatment. Thanks for considering!


#2

Hi Frank,

I agree this will be an important extention for brms. Does anyone have any experience with copulas in Stan in general? My own knowledge in this area is still rather limited. Also, if we end up agreeing this is doable, we should open an issue in github (https://github.com/paul-buerkner/brms) where I keep track of all the feature requests.


#3

Great. Sorry I didn’t think of opening an issue in Github. I hope some people with Stan experience with copulas respond. The paper I referenced does not provide any code and I’m not sure which software system they used. I’ll contact them. I’ll add this request to Github.


#4

There is an excellent paper about using copulas on count data:

A PRIMER ON COPULAS FOR COUNT DATA
BY CHRISTIAN GENEST AND JOHANNA NESLEHOVA

Advantage of copulas are to model tail dependencies, multinormal gaussian only consider
linear dependencies between random variables.

Some work had only been done in Stan by Ben Lampert, it’s based upon the Poisson
distribution, but can easily enhanced to other discrete distributions.

Copulas are one way to go - a very flexible. This comes to a price though. Technically demanding,
but solvable. Then also there’s the need of more data or stronger priors.
One has to consider if not multivariate extensions, eg. Laguerre Polynom(s) already may be
suitable. Then it comes to the point where we have to analyse your data,
happily, if not, we already have had been replaced by some sophisticated deep learning
whats’o’ever.
Your question is to vague to give an answer. (By no means I want to offend anybody)
Just my 2 cents.


#5

@bgoodri has been talking about adding them to Stan. He is behind most of our multivariate stats.

The usual obstacles to new features are a clear design (what does this look like to you functionally when done) and someone to do it.


#6

More the latter. They are basically just density functions for a multivariate random variable with uniform margins, so it is not as if there is much discretion in the design.


#7

I wasn’t talking about anything fancy here, just:

  • function signature,
  • mathematical definition, and
  • naming conventions.

Is there a math lib issue somewhere?

More importantly, is it something you (Ben) think we should do?


#8

I think it should be done, but there is no issue.

The signature would be a bit different from what we currently have in Stan Math because the random variable is bivariate or multivariate. I suppose we could require a vector or array of vectors that is exactly of length two, but it would look better in the Stan language if it were something like target += clayton_copula_lpdf(foo, bar | tau);.

The generic definition is


but there are dozens of specific ones within that definition that each have their own density functions. The names are reasonably straightforward because they tend to be named after people, but there are various parameterizations. A sane library would try to parameterize as many of them as possible in terms of Kendall’s \tau or something.


#9

That’s definitely beyond what the language will support now. How would you feel if that had to be a tuple, as in:

clayton_copula_lpdf( (foo, bar) | tau);

or

(foo, bar) ~ clayton_copula_lpdf(tau);

For now, I’d be OK with one that took an array or vector of size two.

Is the size two constraint because that’s the only CDFs we’ll be able to implement?


#10

I think a tuple of size two or an array of tuples is fine. Most known copulas are bivariate, but some are multivariate (Gaussian is the most common multivariate one). Also, some bivariate ones have multivariate extensions, which will put some stress on our naming conventions.


#11

This is what I meant by a naming convention design. I don’t mind must putting multi_ in front of something or overloading. If you could write that up as a math lib issue with a single example to do first, we might be able to find someone to code it.


#12

I’m very glad to see this discussed. I can’t emphasize enough how many applications would benefit from copulas. It is the norm in randomized clinical trials, for example, to analyze all patient endpoints separately with no borrowing of information, not to mention using ad hoc frequentist type I error control based on independence of endpoints. In clinical trials costing tens of millions of $ we don’t even learn how the various patient outcomes ‘run together’ nor do we profit from the correlations in terms of frequentist (or Bayesian) power. At FDA I’m pushing the utility of computing P(drug benefits outcome 1 and/or drug benefits outcome 2) and especially for the “and” the dependence modeling is crucial.


#13

Looks like we’re about to get a bunch of copula functions ported to Stan from the vinecopulib package. The first target copulas are

Gaussian, student-t, clayton, gumbel, frank, bb1, bb6, bb7, bb8

if that means anything to you. There’s no strict timeframe, as we rely on volunteer contributions.


#14

This is great news Bob. Sorry for the delayed response. I hope you get volunteers. I am not familiar with most of those copulas. I think what is important is a general framework that allows combinations of dependent variables of mixed categorical / ordinal / continuous types.