I am seeking advice about what I should learn in order to do this project. The goals of the project are to understand the linguistic signatures of turnover. I have millions of email messages from four different organizations where the content have been transformed into a set of roughly 25 linguistic features. I have the age, gender, an organizational membership of each person in the dataset. I also know if they voluntarily quit, involuntarily terminated, or if it was simply right censored.

My current approach is to aggregate the linguistic features within 30-day periods, then use a survival model to estimate the hazard of quitting. Depending on the aggregation functions, aggregating within month features can multiply number of features by month-person into 700+ features each month (comparing incoming messages to outgoing messages, averages / variance within the conversations within the person’s local network, etc).

So far I’ve attempted an elastic net survival model and an xgboost survival model. But I’m curious what a Bayesian to this task might be. Where would you start with such a project and what kinds of models would you attempt?

I presume you have the linguistic and demographic features per message and a timestamp for each message? Do you also have to/from ? Are you able to infer a messaging network?

In my mind, the purpose of the Bayesian approach is to approximate how the observed data is generated. Everything interesting proceeds from there. If you don’t know anything substantial (theory) about the domain (e.g. why people leave) then I’d start with an exploratory analysis aimed at helping me gain some domain knowledge. I’d compliment that by reading a lot on the subject to make my ideas more relevant. I’d re-emphasise the importance of the latter. Very often the best model is supported by the data but is very unlikely to be derived from it, and in that case your inspiration cannot come from the data.

I’d then look from the outset where my model can be factored (made separable or hierarchical) so that I can focus efforts into local pockets some which are more likely to be predictable than others.

Finally, I’d model each pocket separately dependent on what I discover, and I’d look for opportunities to pool or merge parameters between pockets where appropriate, eventually culminating in a model of the whole.

Prima facie, I’d say that whatever your model, it’s more likely to be dynamic (a process). E.g. dynamic survival models, dynamic Poisson, Gaussian process, etc. That way you can take the whole data into account and don’t have to make arbitrary cuts. A big advantage of the Bayesian approach for decision making is that you’re able to quantify uncertainty on the whole. This advantage is cut short if you make arbitrary (or unparameterised) treatments to the data such as cuts, windows, etc.

We’ve been more and more emphasizing this generative perspective. Especially for scientific models, which are often composed of a mechanistic scientific model coupled with a more statistically oriented measurement model, the goal being to work back to the parameters of the scientific model and make calibrated predictions going forward.

The relation of the data to the model is that we want the model to be able to capture relevant properties of the data. Like marginal counts and variance and means. So we use tests like posterior predictive checks to see if our models actually match our data. Usually the particular data we have is determined by the measurement process. So in that sense, you need to take the source of the data into account in a full Bayesian model.

The roadblock we run into when we try to just scale everthing up to a full Gaussian process is computation. So we often have to make pragmatic tradeoffs based on what’s computable as well as what we want our models to be.