Group size parameters in ridge regression with mean embedding kernel

I discussed this model with Erik and Ben about a year ago and they said “Why would you want to do this?” That may be your response after reading this, but bear with me.

I wrote the stan code below to solve the alpha parameters for logistic kernel ridge regression. The kernel K is an P x P matrix of similarities of N different observations. The difference between this and typical kernel regression is that the similarity measure for each N observation is a multivariate euclidean distance of features transformed to a similarity with an RBF. The point of which is to allow each observation to be represented by any number of features and a varying number of measurements (repeated across space, not time). This is generally classified as Distribution Regression, see this paper. I discuss a bit more of the specifics of my case in this Cross-Validated question

The point of this question is to figure out how to include the quantity of measurements, call it eta, per each N observations as a parameter in the model. The code below treats each element of the matrix K with the same weight. What I hope to do is add a parameter that partially pools the alpha weights based on the magnitude of eta. A very similar idea is covered in Bayesian Approaches to Distribution Regression and PDF -
Who Supported Obama in 2012?: Ecological Inference through Distribution Regression

Stan code:

data {
  int<lower=0> N; # number of observations/groups
  int<lower=0> P; # dimension of similarity matrix K
  matrix[P, P] K;    # similarity matrix of observations/groups
  vector[N]    y;    # binary target/response variable (presence/absence)
  real         lambda; # ridge penalty (could be learned)
transformed data{
  matrix[N, N] m;
  matrix[N, N] ident_N;
  ident_N = diag_matrix(rep_vector(1, rows(K)));
  m = K + lambda * ident_N;
parameters {
  vector[N] alpha_hat;
transformed parameters{
  vector[N] q;
  vector[N] e_;
  vector[N] diagW;
  vector[N] pi_;
  vector[N] spec;
  vector[N] Kalpha;
  vector[N] alpha;
  for (n in 1:N) {
    alpha[n] = 1.0 / N;
  Kalpha = K * alpha;
  spec = 1 + exp(-Kalpha);
  for (n in 1:N){
    pi_[n] = 1.0 / spec[n];
  for (n in 1:N){
    diagW[n] = pi_[n] * (1.0 - pi_[n]);
  for (n in 1:N){
    e_[n] = (y[n] - pi_[n]) / diagW[n];
  for (n in 1:N){
    q[n] = Kalpha[n] + e_[n];
model {
  alpha_hat ~ normal(0,10);
  q ~ normal(m * alpha_hat, 0.01);
generated quantities{
  vector[N] yhat1;
  vector[N] yhat2;
  yhat1 =  (1.0 + exp(-(K * alpha_hat)));
  for (n in 1:N){
    yhat2[n] = 1.0 / yhat1[n];

R Code:

K = matrix(c(0.343, 0.288, 0.235, 0.225, 0.314, 0.246, 0.240, 0.246, 0.269, 0.344,
             0.288, 0.404, 0.234, 0.203, 0.341, 0.320, 0.316, 0.300, 0.333, 0.328,
             0.235, 0.234, 0.225, 0.186, 0.233, 0.176, 0.182, 0.172, 0.179, 0.272,
             0.225, 0.203, 0.186, 0.252, 0.260, 0.239, 0.236, 0.288, 0.261, 0.257,
             0.314, 0.341, 0.233, 0.260, 0.466, 0.409, 0.390, 0.476, 0.460, 0.436,
             0.246, 0.320, 0.176, 0.239, 0.409, 0.509, 0.489, 0.562, 0.544, 0.320,
             0.240, 0.316, 0.182, 0.236, 0.390, 0.489, 0.486, 0.537, 0.518, 0.309,
             0.246, 0.300, 0.172, 0.288, 0.476, 0.562, 0.537, 0.715, 0.625, 0.362,
             0.269, 0.333, 0.179, 0.261, 0.460, 0.544, 0.518, 0.625, 0.599, 0.365,
             0.344, 0.328, 0.272, 0.257, 0.436, 0.320, 0.309, 0.362, 0.365, 0.481),
       nrow = 10)
y = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0)
P = nrow(K)
N = length(y)
eta = c(5,10,3,26,12,7,6,12,3,8) # what to do with this?
model_dat <- list(K = K, y = y, P = P, N = N, lambda = 0.1)
fit_normal <- stan(file =  "LMERR_stan.stan", data = model_dat)

If I am crazy, please set me straight. If you have a better idea for binary regression that allows for each observation to have multiple measurements spatially distributed around each ‘site’, please comment! If you have ideas for how to learn the ridge penalty lambda, I am all ears! Any help is very much appreciated.

PS - this set-up looks a bit like a situation for multi-level model, but I do not believe that is the case because each observation/group is unique and I want to predict this on many groups that are not see in training. Then again, I could be misguided in that thinking!

I’ll defer to @bgoodri on the stats (I’m not familiar with this kind of model), but I can make some suggestions for the code. This

vector[N] yhat1;
  vector[N] yhat2;
  yhat1 =  (1.0 + exp(-(K * alpha_hat)));
  for (n in 1:N){
    yhat2[n] = 1.0 / yhat1[n];

can be

vector[N] yhat2 = inv_logit(K * alpha_hat);
vector[N] yhat1 = inv(yhat2);

And this

for (n in 1:N){
    diagW[n] = pi_[n] * (1.0 - pi_[n]);

can just be rolled into the definition as

vector[N] pi_ = diagW = pi_ * (1 - pi_);

The only one that can’t be vectorized is the 1.0 / N case, but that one should be defined in the transformed data block, as it doesn’t involve parameters.


Please define “better”. I have another way, use a beta-binomial to reflect the number of measurements
in binomial (or bernoulli) sense and Kernel/shrinkage to reflect your co-linear dependencies.

Logistic regression I don’t see you use a inv_logit, nor softmax function. I admit I not understood your code
very well, and reading 45 pg of PDF is over my time budget.

1 Like

@andre.pfeuffer thank you for your idea, I will think more deeply on that approach. Also, I provided the PDF links for context; I did not want to presuppose much when asking a question to the world’s experts on stan. If you or anyone find it of interest, I wanted to quickly explain why I am modeling from a similarity matrix and not observations and counts. Perhaps that will clear the code up a tad. (Also see @Bob_Carpenter’s helpful notes above e.i. inv_logit())

Ultimately, the goal is to model the probability of belonging to one of two classes based on the similarity of individuals projected up to the groups they belong to. Modeling individuals directly is avoided because they are highly correlated and the groups are the unit of interest. I’ll call the groups ‘patches’ since they are natural/environmental and each is unique and non-overlapping. In the PDFs this is discussed as a “two-stage” sampling approach; 1) individual observations sampled from a patch; and 2) patches sampled from a population where each patch is in one of two classes; y = {0,1}. Each patch is represented by individual observations x_i^j and class label y_i; together that compose patches z_i to z_n. In my case, each x_i^j is a n x p matrix of individual observations and covariates. The number of covariates p is the same across all z_i, but the number of observations n in each x_i^j can vary. Finding a way to capture the variance of n for each z is the basis for this entire thread.

Two stage sampling:

In order to project the individual observations of each x_i^j (sampling stage 1) to the label of y_i for each z_i we need a function phi. In my case I use an RBF kernel on the element-wise distance of each x_i^j and x_i^j^{\prime} [x and x^{prime} from here out]. Finally, I take the mean of each phi<x\dotx^{\prime} to equal \hat{mu_{x,x^{\prime}}. These products compose matrix K and are the data being modeled.

The matrix K (presented in my original post) is an patch-wise (z,z^{\prime}) matrix of each \hat{mu_{x,x^{\prime}}. Plainly, each element of the matrix K is the patch-wise average similarity of all the individuals within each patch. This matrix and the patch labels y_i are the primary data to be modeled. So the simple model is to find a function mapping \hat{mu} to y.


A method proposed to do so is use Kernel Ridge Regression by minimizing the squared loss of the function to the label y with a penalty term \lambda. The code I posted in my original question solves for \hat{f} with a fixed lambda and generates \hat{y} with the inv_logit() (as per Bob’s comment).


and analytically calculating predictions. [not in the code I posted above]


I apologize if this further muddies the water. I do not expect anyone to read 45 pages of PDF to help me, but any tips are very much appreciated.

Take a look at about how to incorporate your eta. The only hard part of the blog is
that I don’t understand baseball.

About learning lambda, as you implemented it, you have to do a leave-one-out crossvalidation
and calculate the responses each leave-out. You could also try Aki’s regulized horseshoe ,
this would give you sparsity. Since you go by ridge regression likely not what you want.

I see. That’s a typical setup in survey sampling, so you may want to look into that literature. I’m not an expert.

I wrote a very similar case study for Stan using baseball data. But any repeated binary trial data looks the same.