Advice for a novice in hierarchical modelling with Stan

Hello everyone, as the title says I’m quite new to multilevel modeling, and more specifically applying a Bayesian approach using Stan. I’ve been reading sections of Gelman’s book (2006), internet resources as well as case studies from Stan’s website. The aim of this post is, initially, to explain the definition of a modeling problem to know your opinion -more experienced than mine- about if it makes sense what I want to achieve and if this is a valid mechanism for it (multilevel modelling with Stan). If this is the case and if you recommend me some additional resource to guide me, I would be very grateful and maybe one day I will come back with a modelling proposal and more concrete doubts to share.

First of all, I want to say that the problem is a simple exercise that I have taken as a motivation to go deeper in learning new methodologies for me (multilevel causal modeling with Stan). Please take it as such even if it seems to you a problem with an unrealistic goal.

Inside a large building dedicated to research there are people-counting sensors at certain locations (cross-sectional data) that are processed at a certain frequency (30 min, thus panel data in the end). There is no person recognition in the counting, so they are just aggregate flows. The aim of the research design is to determine whether change is observed after the implementation of telework measures and subsequently after the return to “normal”. Therefore I had thought of a piecewise growth model where the points of change are the specific days of implementation of measures and thus be able to quantify the changes taking into account the between-sensor variability (partial pooling). In this way, level 1 would be the daily repeated measures and level 2 the sensors/locations. Does it make sense? The thing is that I haven’t found it very common to proceed with a piecewise growth in multilevel/Stan.

On the other hand, the daily measures are a timeseries with 288 points (5 min). I consider a bit unfeasible making a multivariate approach (https://mc-stan.org/docs/2_20/stan-users-guide/multivariate-outcomes.html). However, I would be still interested in being able to detect change according to the time of day, mostly to check if ‘back to normality’ is the same for different timeslots. Therefore, I think about making groups (say 2/3 hours), summarizing and (1) checking the between-timeslot variance in a 3-levels model, or (2) fit independent 2-level models for each timeslot with a single univariate sumarized measure, or unsummarized multivariate. Do you consider one alternative more formal than another?

Finally, I know beforehand that there is weekly seasonality since for example on Saturday or Sunday there will be a lower flow. However, I have doubts about how to control this effect so that it does not affect the piecewise growth model. Would it be through group level predictors?

As you can see, some doubts are narrowed down to a level more of modeling than making fit in Stan, but I think that in this forum there are people very experienced with this kind of models and also it is my interest to use a Bayesian approach with Stan. My apologies if the forum is exclusively for doubts that already include the model proposed in Stan. Any kind of help or advice will be well appreciated!

This is a relevant Gelblog post: https://statmodeling.stat.columbia.edu/2019/06/30/what-if-the-authors-of-that-regression-discontinuity-paper-had-only-reported-their-local-linear-model-results-with-no-graph/

Not to say the idea is good or bad – it’s just relevant discussion.

For an idea like this I’d try to think of it in terms of the lme4 syntax. Something like:

y ~ saturday + sunday + location

would be a model for the counts given saturday and sunday are indicators for the day of the week and location are the different locations for the sensors.

You could replace location with (1 | location) to get partial pooling between locations, but it’s not obvious to me in this case you’d want that:

  1. Why would different locations be similar?
  2. If you’re collecting this data every 30 mins, sounds like you’ll have a lot, in which case no pooling will be necessary

What is the thing you want to detect?

1 Like

Yes, indeed. Very interesting read. I do also follow the blog. And this enforces my doubts if I’m making the right choice. But well, I’m learning, so if I’d be wrong I’d like to know too :)

  • They should because the measurements have been taken (and released afterwards) for all the areas at the same timepoint and applied during the same timespan. So let’s say they should have observed the same ‘relative’ drop (sudden drift) in counts if they have complied with the measures.

  • Yes, but taking into account that I’d like to account for diferents periods of the day that means one observation per day (summarizing the counts for 8-9, 12-13, 16-17 as peak hours for example).

I want to assess if measures have been followed with the same level of compliment in the different locations and at the different periods of the day. So that’s why I thought in considering it as a random effect, to account for the variability. And the piecewise growth was motivated for the expected form like a step function.

This seems like some sort of overall multiplicative effect. So I guess if this is modeled on the log scale it’ll be like a group mean?

Also have you looked through prophet: Prophet | Forecasting at scale. ? This reminds me of that too. There was a presentation at Stancon a couple years ago: https://www.youtube.com/watch?v=E8z3LObimok . There are probably others as well.

Yep

I was aware of prophet as a toolkit for time-series modelling using Stan under the hood for fitting the parameters, but never used it yet. I’ll take a closer look at it. Thanks!

Just to be clear I’m not saying use it necessarily – it can be more fun to make stuff yourself. But I remember thinking the way they did their stuff was pretty neat and different than other things I’d seen.