Predictive projection for linear model with student-t noise: clarifying question(s)

Marty · May 31, 2022, 11:29pm

Yes. Ken French’s data library contains data sets widely used in the finance literature, and is well documented. I use this for simulation and testing in most cases before moving to the problem-specific data. I have provided some R code at the end to quickly download one of the sets and get a sense of the shape. If this data is not large- or thick-tailed enough I can provide more (just not through public link).

Will do! Thank you for sharing those papers, I will try the analytic solution if optimisation too slow or unstable.

OK thanks. Although I don’t understand why do we still need to perform projection after selection? I may have misunderstood the workflow; I thought that once we have selected the submodel variables for each size using projection, we fit each submodel to observed y and select final model size based on LOO or k-fold CV.

Thanks again for all the help!

Here is R code to quickly look at some of the open returns data mentioned earlier.

##--- Download file, unzip
f <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/49_Industry_Portfolios_CSV.zip"
dest_file <- "49_Industry_Portfolios_daily_CSV.zip"
download.file(url = f, destfile = dest_file, mode = "wb")
unzip(dest_file)
file.remove(dest_file)

##--- Read and process
d <- readr::read_csv("49_Industry_Portfolios.CSV", 
                     col_types = paste0(rep("n", 50), collapse = ""), 
                     skip=11)
# Two sets of data are stacked, retrieve only first
ix <- which(is.na(d[, 1]))[1]
d <- d[1:(ix - 1), ]
# Remove NAs (coded as -99.99 or -999)
d[d < -99] <- NA
d <- tidyr::drop_na(d)
# If you want to keep date column, process using lubridate, 
# otherwise just remove it. Will be slightly different for daily data. 
library(lubridate)
names(d)[1] <- "date"
d$date <- lubridate::ymd(d$date * 100 + 1)
d$date <- lubridate::ceiling_date(d$date, "month") %m+% -days(1)

##--- Plot histograms and compare to Gaussian
for (i in 2:ncol(d)) {
  yi <- d[[i]]
  hist(yi, main=colnames(d)[i], probability = TRUE)
  xi <- seq(min(yi)*1.1, max(yi)*1.1, len=500)
  dn <- dnorm(xi, mean(yi), sd(yi))
  lines(xi, dn, col='blue')
}

avehtari · June 1, 2022, 8:24am

What you describe correspond to Section 3.3.1 in A survey of Bayesian predictive methods for model assessment, selection and comparison, and the decision theoretically better way to do the inference after selection is described in Section 3.3.2.

Thanks!

Marty · June 1, 2022, 9:53am

Ahhh I see! I feel like my brain is going to explode! So the projpred method you apply in “Using reference models in variable selection” and “Projective inference in high dim problems” corresponds to Section 3.3.2 of the survey paper?

I will be using the clustered projection approach, so is PSIS-LOO still a reasonable choice if I use only ~10 clusters?

Thank you very much for all clarification. This is extremely helpful.

avehtari · June 1, 2022, 3:17pm

Yes, with additional details in Section 5.4 and specifically the unnumbered subsection Parametric projections and following Goutis and Robert (1998) and Dupuis and Robert (1997, 2003) and Eq (132).

Unlikely, but not impossible. It’s difficult to diagnose the quality of the approximation with that few

Marty · June 1, 2022, 7:37pm

Great, thanks for the extra information.

OK, understood.

Thank you very much once again!

Topic		Replies	Views
Very basic projection predictive variable selection question Modeling projpred	4	464	July 2, 2022
Projection predictive variable and structure selection for GLMMs and GAMMs Publicity	8	732	October 16, 2020
How are sub-models fitted in Projection predictive inference (projpred)? General projpred	4	264	March 20, 2024
Usage and Interpreation of cv_varsel-function Modeling	4	67	January 23, 2025
Projection predictive variable selection and robust regression Modeling projpred , brms	2	620	March 9, 2021

Predictive projection for linear model with student-t noise: clarifying question(s)

Related topics