Quadratic distance in regression

Was reading a Gladwell book a few years back and he noted that in education, classroom size was believed to be a big predictor of student success. The early idea was that if a classroom size was too big, some students would simply get ignored and miss out on attention, contributions and other forms of active learning.

When this idea was implemented, the limit is 1:1 teacher for student. And as this limit was approached, the marginal benefits of one fewer student reversed. The explanation he offered is that fellow students can ask insightful questions; you learn a lot from how your peers interact with new information.

So, too many students brings attention deprivation and too few students removes peer effect. So it’s a “happy medium” problem where we don’t know what the happy medium is. The impression I walked away with is that this sort of pattern is actually quite frequent.

I wondered if this could be articulated in a model where \alpha is the latent optimal value.

y = \beta_1 (x-\alpha)^2 + \beta_0

Sure. To complete the model, you need to add an error term, like
\mu = \beta_1(x - \alpha)^2 + \beta_0
y \sim \textrm{Normal}(\mu, \sigma)

Here, \beta_0, \beta_1, \alpha, and \sigma are parameters to be estimated.

If you’d like, you can expand the first line to look more like a traditional regression formula:
\mu = c_0 + c_1x + c_2x^2
with
c_0 = \beta_0 + \beta_1\alpha^2
c_1 = -2\beta_1\alpha
c_2 = \beta_1

1 Like

@Jacob_Moore You’re asking about what’s typically called the “vertex form” of the equation for a parabola. @jsocolar also offers the “standard form” that looks more like a typical regression equation. It’s important to be careful not to mistakenly think of \beta_0 and c_0 as equivalent intercepts—you will have to think differently about your priors depending on the form you choose. The two forms are, however, mathematically equivalent. It follows that you can use the joint posterior samples of c_0, c_1, and c_2 from a traditional regression-style model to calculate the implied posterior distribution of the optimum class size (\alpha). Just make sure that you do the calculation on each joint sample before any summary statistics because of Jensen’s inequality.

Although mathematically the same, sometimes a model will have slow or poor sampling behavior in one form, but fit faster or seamlessly in another form. Re-parameterizing to another mathematically equivalent form can be a useful trick to keep in your practical toolbox.

EDIT: clarity

2 Likes

Is there a benefit to formulating the standard way? Perhaps these variants are centered/uncentered, effecting sampling.

In most applications that I envision, the value of \alpha will be positive but anywhere in the span of 1-500. So I’d need a very loose prior, indeed. I wonder if choosing priors for c_0, c_1, c_2 in standard form is easier; for example, N(0,1) might be reasonable for all of them. Any thoughts here?

Naively, I expect that the standard form would be easier for the sampler than the vertex form, but not really based on principles. Let’s code up an example…

library(brms)

# set parameters
b0 <- 20
bx <- 5
bx2 <- -0.1
sigma <- 5

# simulate data
set.seed(2394)
students <- seq(1:50)
ey <- b0 + bx * students + bx2 * students^2
y <- sapply(ey, function(x) rnorm(n = 1, mean = x, sd = sigma))
dat <- data.frame(students = students, y = y)

plot(y ~ students, data = dat)

# fit standard quadratic
mod_std <- brm(y ~ 1 + students + I(students^2), data = dat)

# fit latent quadratic
mod_lat <- brm(bf(y ~ b1 * (students - alpha)^2 + b0,
                  b1 + b0 + alpha ~ 1,
                  nl = TRUE),
               prior = prior(normal(10, 5), nlpar = "b0")+
                 prior(normal(0, 1), nlpar = "b1", ub = 0)+
                 prior(normal(25, 10), nlpar = "alpha", lb = 0),
               data = dat)

pairs(mod_std)

pairs(mod_lat)

I get divergences for the vertex form , but not the linear form. I didn’t really try to make the priors identical, but the funnel shapes in the vertex form pairs plot suggests this is a more fraught way to go than the standard linear form. Add in some additional caveats that brms is very much optimized for linear models and very flexible for non-linear models. There may be Stan tricks for the vertex form that aren’t being exploited here.

Prior predictive checks will be your friend. Once you set your priors, just sample from these without updating the likelihood with the data (for brm(..., sample_prior = "only")). You can calculate the distribution of \alpha implied by the priors and make sure that it is concordant with your prior information.

1 Like

Standard normals on c would not put any appreciable mass on values of \alpha approaching 500. If you know that \alpha is constrained positive, this is potentially a good reason to parameterize with the vertex form, since you would be able to set that constraint/prior.

It’s certainly true that for the parameter values and datasets that @wpetry is exploring, the traditional parameterization seems to avoid problems, though I note that the problems in the vertex parameterization can be substantially lessened (though not completely avoided) by standardizing the coariate and the response.

1 Like