So there are 3 dummies encoding different time periods (first_period, second_period & third_period) and fixed effects (channel_id & week). I was not able to find a stan_plm method so I selected stan_lm:
It seems like the fixed effect-dummies need to much memory (I have about 50 weeks and 5000 channels), that’s why I am not able to run it (Error: cannot allocate vector of size 3.3 Gb). Now I would have some questions:
Are there already prepared models for estimating fixed-effect models like stan_plm, stan_fem or something similar?
Are there alternatives to dummy-encoding or do I have to do a manual demeaning beforehand?
We don’t yet have stan_plm or anything like it in rstanarm, but adding the dummy variables is equivalent to what plm does by default. You can do the demeaning but that results in a posterior distribution that understates you uncertainty due to all those estimated means.
My guess is that it is the QR decomposition that is exhausting the memory rather than the dummy variables themselves. I guess we could do the QR decomposition sparse. If you can estimate it with the biglm function in the biglm R package, there is a stan_biglm function you can use.
Does QR decomposition drain memory - and so shouldn’t be used on a large data-set? Interesting.
How to control for agent (individual-specific) fixed effects then, when one has too many agents? For example, I have a panel of 10,000 firms say over time (T =20). I will run a hierarchical model across 5 time periods (binned time intervals). But also want to control for time-constant, firm-specific, fixed effects. How to do this optimally? Having one dummy for each firm does not make sense.
If you would prefer me to start a new post on this I can. But the topic seems quite similar. Any help would be much appreciated.
Doing a QR decomposition in the Stan language (as opposed to beforehand in R) is memory-intensive because it produces an orthogonal factor that is N \times N rather than N \times K. But we added a qr_thin_Q function a couple versions ago that avoids this problem and basically does what R does, although R can compute it using multiple threads.
I haven’t thought that much about your second question, but I think you would want to do 10,000 thin QR decompositions and then restrict the coefficients on the \mathbf{Q} matrices to be the same or else not too different across firms.
Maybe you can try reformulating you model matrix into a “dense” part (your predictor variables) on which you can use QR and a sparse part (the fixed effects) on which you can use csr_matrix_times_vector. It’s not perfect, but worked for me in a (presumably) similar situation.