Dynamic prediction over time

I need to build a statistical model to predict the time of assembly of a machine. Those predictions should be updated during the assembly on a daily basis. I have some variables which for each machine built do not change in time such as the machine model and others, and some variables time-dependent such as the number of missing parts required to make the machine, the number of days spent on the machine so far and so on.

So in the dataset for each machine made in the last years (about 100 of them) I have a number of rows equal to the number of days spent to build that machine, in which the response variable is always the same (known at the end of the assembly), while some of the predictor variables changes.

How can I model this kind of data? I think is not a repeated measures analysis because for each machine the response is always the same, only some predictors changes, I just make predictions several times over time because I expect to improve my forecast approaching the end of the assembly. Of course I can change the response variable to be the number of days remaining to finish assembly instead of the total number of days of assembly.

Is there a Bayesian solution to model this type of problem telling the model via the machine id which observations refer to the same machine on different days?

Interesting problem. Take a look at survival analysis. Seems like there could be strong path dependencies though – are there cases where progress stops or the addition of other parts is less effective because a particular part is not available / built? If there are then something wildly nonlinear like a tree is probably needed. If not then maybe something like nonlinearish regression with predictor*time interactions would do it.

In addition to @Charles_Driver’s response, I’ll add that you should try to write code to generate fake data. That will help us understand the structure a bit more clearly and will help you start thinking of how to reflect your knowledge of said structure in a generative model.

Sounds like you have a question you’re trying to answer (“given the state of assembly of a device, how much longer will it take to finish?”), and you have data from past assemblies and so could do some exploratory analysis and build a model and make some predictions.

If you’re stuck on how to analyze this, try to generate fake data from a factory process. Think through how can you write down a random process with parameters and such that describes a manufacturing process that looks something similar to what you think you’re working with. It might you some ideas on how to look at the data you have, or ideas for other measurements you think need made, or ideas on what parameters are actually unknown/known.

There might be elements you can adopt from hazard models. Like, in your situation the machine inevitably gets built (so the end is the same). In hazard models the inevitable thing is usually something breaks or dies (and it’s just a question of when), so that seems similar.

1 Like

Thank you all for the answers.

I created a fake dataset similar to what I have:

fake_data <- data.frame(
  machine_type = sample(x = c("Machine A", "Machine B", "Machine C"), 50, replace = TRUE),
  lot_size = sample(x = 1:4, 50, replace = TRUE)) %>% 
  rowwise() %>% 
    days_assembly = round(case_when(machine_type == "Machine A" ~ 20 + rnorm(1, lot_size * 2, 10),
                        machine_type == "Machine B" ~ 20 + rnorm(1, lot_size * 2, 10),
                        machine_type == "Machine C" ~ 40 + rnorm(1, lot_size * 2, 10)))
  ) %>% 
  rowid_to_column(var = "order_nr") %>% 
  mutate(day = list(seq_len(days_assembly))) %>% unnest()

fake_data_2 <- fake_data %>% 
  group_by(order_nr) %>% 
  rowwise() %>% 
    hrs_day = round(rnorm(1, 5, 2)),
    PNs = sample(c(-1, -2, 0), 1)

fake_data_3 <- fake_data_2 %>% 
  group_by(order_nr) %>% 
    cum_hrs = cumsum(hrs_day),
    missing_PNs = 40 + cumsum(PNs)

The variables that do not change for each machine over time are machine model and lot size (in real dataset I have more), variables that do change are cum_hrs, i.e. cumulative hours spent on that machine until that day and missing parts necessary for assembly. In this fake dataset they are not related to the outcome (days_assembly), in real data I expect that the more missing parts I have, the more days I will spend to make the machine because I have to procure those parts.

A problem that I face is the limited size of the dataset, about 90 machines, because this is the volume of production of the company in the last 2 years for which I have the data.
Again, for each machine I have a number of rows equal to the number of days spent to build that machine.
I tried Cox Regression with Time-varying covariates (adapting the structure of the dataset) with poor results, maybe because the sample size is not enough for the number of predictors that I have (about 10), and the proportional hazard assumption is not fullfilled.
So basically right now the best model that I have is the linear regression, but I am just training the model with all the observations, without telling the model which obs. are related to the same machine, violating the IID assumption, so I think there are much better solutions.

Is there a way to do this in base R? Like the process of one machine?

Maybe you can’t model the whole production process in a blackbox sorta way, but it sounds like these machines are quite large (if there are only 90 of them and they can afford to hire a statistician…), and so maybe you can break this down in some other way.

If it really is a dataframe with 90 rows and days-to-build, you might not be able to do much more than make a few exploratory plots and maybe draw a line or two.

This makes me think there’s more than one row for every machine. Similar to writing down the build process to the machine, you could think about how to the non-end-time measurements correspond to different parts of the build process.

Sounds like what you need here is a joint longitudinal and time to event model (basically a joint model of some kind of survival model be it Cox or something else, and a longituindal model typcially a linear mixed model). These are available in Stan in rstanarm package - but I’ve never used this so I dont’ know if it can do dynamic prediction.

However there is an R package JMBayes that definitely does have dynamic predictions: https://github.com/drizopoulos/JMbayes
Dynamic prediction tutorial: http://www.drizopoulos.com/vignettes/dynamic_predictions


Thank you very much jroon, I am trying joint longitudinal and time to event model, it seems to be the right solution for this kind of problems.
For now the predictions seem not to improve over time, and taking the median survival as point forecast and comparing it to the forecast I can make with a simple linear model assuming IID I still got worse results, but I hope to improve because this model is the way to go.
Do you think that with only 50 machines in training set (for a total of 1300 daily observations) I can only add few variables in the longitudinal part of the model? It seems that if those are too much (5 instead of 3) the models get worse. Do you have some general advices from your experience?

Thanks again


Are you sure that the longitudinal metric truly has an effect - if it does not I would not expect repeated measurements to improve things.

Hmm. Thats very different to my experience - I usually have a bigger number of individuals and less measurements per individual - for example maybe I would 400 individuals and 1200 measurements. When I have tried to fit smaller datasets I’ve found it usually doesn’t fit well below about 80 individuals. Think of it this way - you need enough longitudinal measurements for a stable fit of the longitudinal model - you are set there - but you also need enough individuals for a stable Cox model - this might be the problem 50 is a bit on the low side here. Does your Cox model have many explanatory variables ? with n=50 that most likely won’t work well