Two-dimensional smooth as outcome distribution

Is it possible in brms, or even in general, to have a two-dimensional outcome distribution in a regression model? I think of something like a tensor product smooth as the response.
The application would be, for example, geographical studies of people’s neighbourhood housing preferences (i.e., which neighbourhood you move into might be determined by external factors such as education, income etc.). So an ideal model would take education and income as predictors for a georaphical outcome variable made up of latitude and longitude. This way, it would be possible to determine which factors are predictive of which area in a city without pre-determining any areas.
Is this possible? And if not, is there a workaround that would yield the desired result anyway?

1 Like

In brms one can easily do multivariate gaussian and multivariate student-t response (see set_rescor and Estimating Multivariate Models with brms • brms). There are definitely ways to hack even more crazy 2D distributions, but then you would need to be able to describe what exactly are you trying to achieve…

With that said, I think it would usually be more useful for you to predict for each combination of person/family and location the preference of the person/family for the location (a one dimensional outcome) and then model the process of moving as some sort of weighted random sampling according to this preference.

Best of luck with your model!

Thanks for your suggestions. I quickly came up with a toy dataset to demonstrate what I try to achieve:
In this dataset, I have 400 households with high-income households living in the upper right quadrant of the hypothetical city and highly educated (binary: 0 = no higher education, 1 = higher education) households in the lower left quadrant.

#set up data with 200 random observations, each linked to a latitude and longitude value
df = data.frame(
  'lat' = runif(200, -50,50),
  'long' = runif(200, -50,50),
  'income' = rnorm(200, 0,5),
  'higher_education' = 0

#add 100 observations of high-income households in the upper right quadrant
df2 = data.frame(
  'lat' = rnorm(100, 25,10),
  'long' = rnorm(100, 25,10),
  'income' = rnorm(100, 15,3),
  'higher_education' = 0

#add 100 observations of highly educated households in the lower left quadrant
df3 = data.frame(
  'lat' = rnorm(100, -25,10),
  'long' = rnorm(100, -25,10),
  'income' = rnorm(200, 0,5),
  'higher_education' = 1

df_main = rbind(df,df2,df3)#unify the observations

I am aware of the possibility that you can run a multivariate model like so:

mv_mod <- brm(mvbind(long,lat) ~ income + higher_education, data = df_main)

This, however, would return the predictors’ effects on lat and long independently (but correlated). This is not necessarily an issue but what would be ideal here is to have an actual 2d outcome - something that would give a result similar to

smooth_mod <- brm(mvbind(income, higher_education) ~ t2(long,lat), data = df_main)

Yet here it is difficult to apply the usual model selection routine since (to my knowledge) a potential spurious relationship between income and higher_education would remain undetected. An ideal model would be something like:

ideal_mod <- brm(t2(long,lat) ~ income + higher_education, data = df_main)

Which of course is not possible in brms.
Therefore I am curious whether there is a workaround to yield such a result.

So I am not sure to what extent is your example representative of your actual scientific problem, but if it is, I think modelling the place where people live as a smooth 2D distribution (of almost any kind) seems (without having deep knowledge of the problem) pretty weird. Where I live, the boundaries between nice and not-so-nice neighbourhoods are often pretty sharp and don’t form any sort of nice blobs - downtown you have (somewhat) cheap housing near a large street (because of the noise), but just around the corner you have some of the most expensive housing. Streets/rivers can create sharp boundaries between good/bad neighbourhoods (by whatever metric).

With that said, splines could probably actually define a distribution - you would need to construct a class of 2D splines that are strictly positive and integrate to 1. I am quite sure there are 1D such splines (used for monotonic effects), so I don’t see any reason why a 2D generalization shouldn’t exist (maybe it already does). Such a distribution could be used with brms with a bit of abuse of the custom family feature - there are a few tricks to build custom multivariate distributions. However, I’ve never seen anything like this done - which might be my fault, but it also might be because it doesn’t work that well in practice.

But I think the alternative of modelling some sort of preference instead of a custom 2D distribution makes a bit more sense. In fact, it IMHO could be understood to be somewhat analogous: when fitting a 2D distribution, we are constructing the conditional distribution \pi(y_1, y_2 | X, \theta) where y_1,y_2 are the coordinates, X is the set of all predictors and \theta a set of all model parameters. This could however be understood as a function from predictors, parameters AND output coordinates to positive numbers y_1, y_2, X, \theta \rightarrow f(y_1, y_2, X, \theta) = \pi(y_1, y_2 | X, \theta) \in \mathbb{R}^+ (satisfying a few other constraints). So you can build a univariate model that tries to approximate f - the density of people living given location and other predictors. The analogy is imperfect, as you need to somehow model the response which will break the strict correspondence, but I just wanted to show that the (IMHO) easier method of doing a univariate regression actually achieves something quite similar to a flexible 2D distribution. Anyway, this paragraph is a wild thought that ocurred to me while biking, so not 100% sure if it is completely sound or whether there is some broader literature discussing such issues.

Best of luck with the model!