Zero-inflated independent variable

Hi,

I’m trying to find the right estimate for a dependent and an independent variable, but I’m not sure what is happening here. The dependent variable is a specific sound made by a type of bird. The values represent how many times a bird makes that specific sound in a minute. The independent variable is how much a bird sings with birds different from oneself. Zero values indicate that a bird never sings with other types of birds. Higher variables represent increased entropy (the bird sings with an increased number of types of other birds). Almost half of the birds only sing with their kind, so the independent variable is ‘zero-inflated.’

See below for three brms suggestions. The first is more intuitive to me. The dependent variable looks normally distributed to me, and I assume that the distribution of the independent variable is not a big concern (?). In the second model (model_2), the dependent and the independent variable is swapped in the formula, but the standardized output looks identical to model_1. In the third model, model_3, ‘family = hurdle_lognormal() is added, and the output is more conservative than in model_1 and model_2. The tail in the pp_check(model_3) is longer than in the density plot, so hurdle_lognormal() is maybe not entirely appropriate.

Which model shows the correlation between the dependent and the independent variables, if any? And, why does model_1 and model_2 produce the same standardized estimates (with the effectsize::standardize() package) if distribution family matters in brms? (Added that the standardized output, not the unstandardized, is similar after jsocolar’s response.)

model_1:
1

model_2:
2

model_3:
3

Density plot of independent variable:
4

data <- tribble(
  ~independent, ~dependent,
  0.475104950825432,24.33385976553,
  0.931752350036424,13.4535568661822,
  0,35.5570188879926,
  0.978876249602107,19.0955559671331,
  0,22.4963311249353,
  1.48092077315076,17.8123502749233,
  0,19.3291471951513,
  0,23.2441762977182,
  0,23.2927636346998,
  0,24.3954700801138,
  0,20.5882384242946,
  0.864789009883719,22.3505775686187,
  0.47019083324118,19.5663614624022,
  1.14237424676109,16.5588026482227,
  1.17147105221271,12.1286866240021,
  0,16.3087052536411,
  0,25.0383445586907,
  0,27.0993725140426,
  0.910750569086581,19.4271473202162,
  0,21.539334714812,
  0,16.1585341270091,
  0,14.8117535496131,
  0,23.7749241592805,
  0.729393530556281,15.7854703429959,
  0,19.6348331819269,
  0,18.7496761710538,
  0.469795349397395,19.9684431411922,
  0,17.5952606880194,
  0.926879666280559,18.014171155283,
  0,26.6404232174919,
  0,34.6939701259565,
  0.468784998340705,23.4421951670494,
  0.873477067021566,14.8475658012473,
  0,24.8547255261819,
  0.925798574658323,19.1606375746935,
  1.3758660136893,14.2614635077128,
  0.907496382286009,5.20771720976394,
  0,11.7335616128042,
  0,11.0442051544141,
  0.585840348633798,22.3243545533164,
  0.476637505556964,23.8691125356285,
  0,23.1696709382993,
  0,21.3636514620346,
  0,32.1265174635231,
  0.925929124062113,27.3552238941316,
  0,26.9585939107343,
  0,20.6204570296389,
  0,12.704953586671,
  0,25.0766825202277,
  0.464895284954346,18.5166531201006,
  0,15.1937439281183,
  0.591470236948633,23.3379526159775,
  0,27.0617666242848,
  0,19.3919112587532,
  0,29.6596276319073,
  0,34.3326412449171,
  0,24.9433341755981,
  0.727810081716114,15.3767806419392,
  0.471772729519587,24.5938346005064,
  0,18.6690857279866,
  0.81491944165422,20.7457489790098,
  0,24.6863561432963,
  0,27.26502689043,
  0,23.153980266219,
  0,22.1294972993748,
  0.47630755819978,25.1068688975237,
  0,18.0092791211805,
  1.37170237143627,13.8415553147245,
  0,22.1942321244425,
  0,27.667486169116,
  0,33.0707348516833,
  0.469377824759326,23.4354194391814,
  0,35.1490587595687,
  0,22.4715336148491,
  0,17.0661503945461,
  0,22.2264237949063,
  0,13.7640418567295,
  0,24.4250051967047,
  0,14.9715091586532,
  0.72951998261547,14.9108687758242,
  0,11.674792236118,
  0,21.0497395321024,
  0,26.3728197700917,
  0.475776553536671,17.6718281712596,
  0,16.1805340750521,
  0,25.5407940439696,
  0.586691374392858,15.07947223304,
  0,23.6832147587178,
  0,11.8085277685732,
  0.461258342759425,14.8125126693847,
  0,19.4737144352526,
  0,16.0046843086061,
  0.91764819052779,10.8793686697501,
  0.921983886262004,7.28807984606729,
  1.00998273481615,9.25967267187909,
  0,20.1824607743782,
  0,30.8862120518263,
  0,20.6327939396891,
  1.29706386830051,13.3517691018069,
  0,21.9623659882357
)

#model_1
prior <- get_prior(data = data,
                   dependent ~ independent)
prior
prior$prior[1] <- "normal(0,5)"
model_1 <- brm(data = data_2,
                   dependent ~ independent,
                   prior = prior)
#model_2
prior <- get_prior(data = data,
                   independent ~ dependent)

prior
prior$prior[1] <- "normal(0,5)"
model_2 <- brm(data = data,
               independent ~ dependent,
               prior = prior)
#model_3
prior <- get_prior(data = data,
                   independent ~ dependent,
                   family = hurdle_lognormal())
prior
prior$prior[1] <- "normal(0,5)"
model_3 <- brm(data = data,
               independent ~ dependent,
               family = hurdle_lognormal(),
               prior = prior)

pp_check(model_3)

data %>% 
  ggplot(aes(independent)) +
  geom_density()```

* Operating System: macOS Monterey
* brms Version: 2.16.3

Hi, welcome to Discourse!
By definition, the independent variable always goes on the right-hand side of a brms model formula, and the dependent variable always goes on the left. The choice of which variable is independent and which is dependent is not always straightforward, since in practice sometimes we are interested in the overall relationship between two quantities without much a priori idea of the direction of causality or of which one’s variation we want to understand as an “outcome”. Taking the language of independent and dependent variables from your post at face value (i.e. you have some a priori grounds for calling one variable independent and the other dependent), only the first of your three models makes any sense. I am certain that the output for model_2 is not identical to the output for model_1 (not even close), though it’s possible that the coefficient estimate happens to be similar purely as a fluke of the data. We can see that the overall output is quite different by comparing, for example, the posterior predictive plots associated with the two models.

You are right that you don’t need to make any special correction for the distribution of your independent variable, provided that the independent variable is exogenous and measured with certainty.

Your posterior predictive plot for model 1 looks pretty good, so there’s no obvious red flag here. However, your dependent variable is a count variable (it is discrete and it cannot be less than zero), and it is probable that you could obtain better inference by using a distribution that is appropriate for counts (e.g. Poisson) in your model.

Thanks for the detailed response! You are correct that the output for model_1 and model_2 differs. Sorry, I should’ve mentioned that the output becomes almost identical when using effectsize::standardize(). model_2 looks doesn’t look good according to pp_check(model_2), so I didn’t expect the similarities between between model_1 and model_2 with standardized estimates.

effectsize::standardize(model_1):
dependent -0.43 0.09 -0.61 -0.25 1.00 3545 2825

effectsize::standardize(model_2):
independent -0.43 0.09 -0.61 -0.25 1.00 3521 2394