Predictive projection for linear model with student-t noise: clarifying question(s)

Yes. Ken French’s data library contains data sets widely used in the finance literature, and is well documented. I use this for simulation and testing in most cases before moving to the problem-specific data. I have provided some R code at the end to quickly download one of the sets and get a sense of the shape. If this data is not large- or thick-tailed enough I can provide more (just not through public link).

Will do! Thank you for sharing those papers, I will try the analytic solution if optimisation too slow or unstable.

OK thanks. Although I don’t understand why do we still need to perform projection after selection? I may have misunderstood the workflow; I thought that once we have selected the submodel variables for each size using projection, we fit each submodel to observed y and select final model size based on LOO or k-fold CV.

Thanks again for all the help!


Here is R code to quickly look at some of the open returns data mentioned earlier.

##--- Download file, unzip
f <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/49_Industry_Portfolios_CSV.zip"
dest_file <- "49_Industry_Portfolios_daily_CSV.zip"
download.file(url = f, destfile = dest_file, mode = "wb")
unzip(dest_file)
file.remove(dest_file)

##--- Read and process
d <- readr::read_csv("49_Industry_Portfolios.CSV", 
                     col_types = paste0(rep("n", 50), collapse = ""), 
                     skip=11)
# Two sets of data are stacked, retrieve only first
ix <- which(is.na(d[, 1]))[1]
d <- d[1:(ix - 1), ]
# Remove NAs (coded as -99.99 or -999)
d[d < -99] <- NA
d <- tidyr::drop_na(d)
# If you want to keep date column, process using lubridate, 
# otherwise just remove it. Will be slightly different for daily data. 
library(lubridate)
names(d)[1] <- "date"
d$date <- lubridate::ymd(d$date * 100 + 1)
d$date <- lubridate::ceiling_date(d$date, "month") %m+% -days(1)

##--- Plot histograms and compare to Gaussian
for (i in 2:ncol(d)) {
  yi <- d[[i]]
  hist(yi, main=colnames(d)[i], probability = TRUE)
  xi <- seq(min(yi)*1.1, max(yi)*1.1, len=500)
  dn <- dnorm(xi, mean(yi), sd(yi))
  lines(xi, dn, col='blue')
}

What you describe correspond to Section 3.3.1 in A survey of Bayesian predictive methods for model assessment, selection and comparison, and the decision theoretically better way to do the inference after selection is described in Section 3.3.2.

Thanks!

1 Like

Ahhh I see! I feel like my brain is going to explode! So the projpred method you apply in “Using reference models in variable selection” and “Projective inference in high dim problems” corresponds to Section 3.3.2 of the survey paper?

I will be using the clustered projection approach, so is PSIS-LOO still a reasonable choice if I use only ~10 clusters?

Thank you very much for all clarification. This is extremely helpful.

Yes, with additional details in Section 5.4 and specifically the unnumbered subsection Parametric projections and following Goutis and Robert (1998) and Dupuis and Robert (1997, 2003) and Eq (132).

Unlikely, but not impossible. It’s difficult to diagnose the quality of the approximation with that few

1 Like

Great, thanks for the extra information.

OK, understood.

Thank you very much once again!