I am seeking advice about what I should learn in order to do this project. The goals of the project are to understand the linguistic signatures of turnover. I have millions of email messages from four different organizations where the content have been transformed into a set of roughly 25 linguistic features. I have the age, gender, an organizational membership of each person in the dataset. I also know if they voluntarily quit, involuntarily terminated, or if it was simply right censored.
My current approach is to aggregate the linguistic features within 30-day periods, then use a survival model to estimate the hazard of quitting. Depending on the aggregation functions, aggregating within month features can multiply number of features by month-person into 700+ features each month (comparing incoming messages to outgoing messages, averages / variance within the conversations within the person’s local network, etc).
So far I’ve attempted an elastic net survival model and an xgboost survival model. But I’m curious what a Bayesian to this task might be. Where would you start with such a project and what kinds of models would you attempt?