Yes. Ken French’s data library contains data sets widely used in the finance literature, and is well documented. I use this for simulation and testing in most cases before moving to the problem-specific data. I have provided some R code at the end to quickly download one of the sets and get a sense of the shape. If this data is not large- or thick-tailed enough I can provide more (just not through public link).
Will do! Thank you for sharing those papers, I will try the analytic solution if optimisation too slow or unstable.
OK thanks. Although I don’t understand why do we still need to perform projection after selection? I may have misunderstood the workflow; I thought that once we have selected the submodel variables for each size using projection, we fit each submodel to observed y and select final model size based on LOO or k-fold CV.
Thanks again for all the help!
Here is R code to quickly look at some of the open returns data mentioned earlier.
##--- Download file, unzip
f <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/49_Industry_Portfolios_CSV.zip"
dest_file <- "49_Industry_Portfolios_daily_CSV.zip"
download.file(url = f, destfile = dest_file, mode = "wb")
unzip(dest_file)
file.remove(dest_file)
##--- Read and process
d <- readr::read_csv("49_Industry_Portfolios.CSV",
col_types = paste0(rep("n", 50), collapse = ""),
skip=11)
# Two sets of data are stacked, retrieve only first
ix <- which(is.na(d[, 1]))[1]
d <- d[1:(ix - 1), ]
# Remove NAs (coded as -99.99 or -999)
d[d < -99] <- NA
d <- tidyr::drop_na(d)
# If you want to keep date column, process using lubridate,
# otherwise just remove it. Will be slightly different for daily data.
library(lubridate)
names(d)[1] <- "date"
d$date <- lubridate::ymd(d$date * 100 + 1)
d$date <- lubridate::ceiling_date(d$date, "month") %m+% -days(1)
##--- Plot histograms and compare to Gaussian
for (i in 2:ncol(d)) {
yi <- d[[i]]
hist(yi, main=colnames(d)[i], probability = TRUE)
xi <- seq(min(yi)*1.1, max(yi)*1.1, len=500)
dn <- dnorm(xi, mean(yi), sd(yi))
lines(xi, dn, col='blue')
}