Linear Regression divergence when increasing number of regressors

Hello there,

I need your help guys :S. I am working on a pretty simple linear model on real and imaginary components for recorded sound pressure \tilde{p} = \hat{p}+n where n is the noise. The model is a truncated sum of the spherical harmonics basis

\mathbb{Re}(\hat{p}) = \sum_{m=0}^{M} A_{m,r}K_m + A_{m,i}C_m \\ \mathbb{Im}(\hat{p}) = \sum_{m=0}^{M} A_{m,i}K_m - A_{m,r}C_m \\

where K and C are given constants. I want to find both A_{m,r} and A_{m,i} til some truncated M. I am modeling n\sim \mathcal{N}(0,1) and the A's are independent, modeled as A\sim \mathcal{N}(0,4). Here the Stan code

data {
  int<lower=0> N_meas;                        // Total Number of independent measurements.
  int<lower=0> M;                             // Number of spherical harmonics modes.                                   
  vector[N_meas] d;                           // Distance loudspeaker-microphone.
  vector[N_meas] pt_real;                  // Measured Pressure at receiver. Real part
  vector[N_meas] pt_imag;                  // Measured Pressure at receiver. Imaginary part
  vector[N_meas] legendre[M];                   // Legendre polynomials
  vector[N_meas] bessel[M];                // Spherical bessel functions
  vector[N_meas] neumann[M];               // Spherical Neumann functions
  real<lower=0> e_real;                   // Prior variance from noise
  real<lower=0> e_imag;                   // Prior variance from noise
  real<lower=0> a_real;                 // variance of normal distributed A's
  real<lower=0> a_imag;                 // variance of normal distributed A's
}
parameters {
  vector[M] Ar;                            // real part of the source strength
  vector[M] Ai;                            // imaginary part of the source strength  
}
transformed parameters{
  vector[N_meas] mu_real;                  // mean of the real part of the pressure
  vector[N_meas] mu_imag;                  // mean of the imaginary part of the pressure
  // Loop over modes 
  mu_real = 0 * d;
  mu_imag = 0 * d;
  for (m in 1:M){
    mu_real += d .* (Ar[m] * bessel[m] + Ai[m] * neumann[m]) .* legendre[m];
    mu_imag += d .* (Ai[m] * bessel[m] - Ar[m] * neumann[m]) .* legendre[m];
  }
}
model {
  Ar ~ normal(0, a_real);
  Ai ~ normal(0, a_imag);
  pt_real ~ normal(mu_real, e_real);
  pt_imag ~ normal(mu_imag, e_imag);
}

This problem is solved for each frequency independently and the data has very little error (recorded in anechoic conditions). For every frequency I am experiencing the same problem when running inferences. Up to certain M the solution converges nicely but getting higher in M it makes it divergent (the M where it gets messed up depends on the frequency). It seems that it has problems having many parameters which should shrink towards 0. The amount of data is around 100-200 observations and the number modes M never goes higher than 10 so the problem is quite overdetermined.

I have solved the posterior analytically and everything works flawlessly, but it would be great if it works in Stan so I can change prior, hyperparameterize… The Stan output is the same as the analytic posterior until M is high enough so it diverges.

I upload the model and the data for a single frequency (100 Hz). For this case it converges up to M=5.

Thanks!

sh_simple.stan (1.8 KB)
data.csv (8.7 KB)

Is it a non/weak-identification issue? You only have fixed number of observations on p but increase your parameter space on A arbitrarily.

That data.csv file is kinda malformed, can you upload the list of data you pass to Stan as an rds file (if you use R?) Looking at the file more closely, it looks like you are using python, so that might not be possible. Maybe stack your data objects of length N_meas in a matrix before trying to format everything in a csv file?

I would also suspect this given the described behaviour.

Hi,

@hhau I will try to put the data properly in R format (I use python…).

Thanks for the fast reply. Maybe it is a weak-identification issue, I am not very familiar with that to be honest. But maybe I can explain a little bit more what is the regression I am using. Spherical Harmonics is an exact mathematical solution for a spherical sound field, which is what we have in the data. The different m's components are orthogonal, which means that either having M=1 or M=5, individual modes of the basis (e.g. m=1) should give always the same result (mode 1 always explains the same contribution of \tilde{p}). The following plot explains it very well. These are the MAP for the real part of A when using M=3 and M=7 both for the analytical posterior and the Stan optimizer.

MAP

Every solution is the same except when Stan gets M=7. And when I get higher (M=10) it cannot even start the optimization ("Line search failed to achieve a sufficient decrease, no more progress can be made")

The same happens with the sampling…If I solve the problem analytically and I draw samples from the posterior and I sample with Stan.

hist

Because the data is so good (very little noise) and it is so overdetermined (180 observations for less than 20 parameters) it is hard to understand what is going on because regularization is not influencing the likelihood as it is quite non-informative and the inversion of the matrix formed by the weights K, C gives very good results (as you can see in the analytical results).

I hope this helps to have a better discussion of the problem.

@hhau I put the data in a R file. It was done using copy/paste brute force (I felt very stupid doing it, but it works). If you run the code it gives back a list called ‘data’ with everything up to M=7. From there I guess you can play around with M. Thanks!
data_r.R (75.2 KB)

I’m sorry you had to do that! Brute forcing things is never nice.

Some immediate standouts from plotting some of the data / the polynomial bases.

  • Some of the early values in neumann are hugely negative, and could be a source of numerical instability. In fact, I think this is likely to blame, setting init = "0" in my call to rstan::sampling made the sampler behave a lot better. -2 * -2876372 is quite a big number for data on the unit scale!.
  • Are the entries in legendre in the correct order? (They probably are, I’ve never worked with Legendre polynomials before, but plotting their values against d made for a very strange plot)
  • The system may be less over-determined than you think, as the polynomial basis seems to match the data quite well, and at some point adding more terms introduces identifiability issues (perhaps around the M = 7, mark)

The other thing I would suggest you do is add a generated quantities section like this:

generated quantities {
  vector [N_meas] pt_real_ppd;
  vector [N_meas] pt_imag_ppd;

  for (ii in 1:N_meas) {
    pt_real_ppd[ii] = normal_rng(mu_real[ii], e_real);
    pt_imag_ppd[ii] = normal_rng(mu_imag[ii], e_imag);
  }
}

So that you can check the posterior predictive distribution as you increase M to see if some tension arises as a result of said increase.

I got this model running with M = 7, using the following call to the sampler:

n_chains <- 5
stan_fit <- sampling(
  object = mod_obj,
  data = data,
  chains = n_chains,
  iter = 800,
  warmup = 750,
  cores = n_chains,
  refresh = 20,
  init = "0",
  control = list(
    adapt_delta = 0.99,
    max_treedepth = 12
  )
)

And in produced this pairs plot for the posterior (beware, 15mb PDF). As you can see, the posteriors for Ar[6] & Ar[5] and Ai[6] & Ai[5] are nearly perfectly collinear. This suggests the model has an identifiability issue for for these larger values of M.

Hope some of this is useful to you / helps your thought process.

Thanks! it is funny that I was running almost the same configuration for the sampler call except the initialization to 0. It makes quite a difference.

  • The legendre polynomials are a function of the angle between source and receiver, so I can imagine that plotting them against d which is the radial distance the plots look weird. I didn’t include the angle because it is not needed in the sampling.
  • I am always confused with the identifiabillity. My understanding is that more parameters will start fitting the noise. But because the basis is orthongonal, I also expect that the parameters that doesn’t capture the noise but the actual signal do always capture that.
  • Very good idea to calculate the posterior predictive.
  • The neumann tend to minus infinity when close to the sources, and I am considering measurements next to them. When measuring further away everything behaves a bit better but still at some point collapses.

Thanks a lot again! I will keep on working on it.

1 Like