How to model load-collective data for only few test specimens

I would like to build a model for the following structure of data, but I do not find the right approach (even key or search words) to look for further information. Maybe you could help?!

Structure of the data
I have about 100 test specimens with different properties size, harness, weight, … so a set of values describing a specific specimen. And one result variable I would like to model i.e. by a GLM.

Additional to this type of data I have a load-collective with about 10.000 measurements for each specimen described by the set of invariable (per specimen) properties above. For example, the measurement of a force acting on each specimen. The data is not
a time series, but it could be measured in an order.

Additionally, I don’t have measurements for each test specimen, but know how often each has been exposed to forces from the load-collective. The distribution of this load-collective is very well known, but is multi modal and not represented by a basic distribution function.

The problem I have now:

  • I would like to start simple and use only the load-distribution in a first step and later see how much better the model gets with single measurements.
  • I don’t know how to feed the load-distribution into the model. Binning of the loads: 0-10, 10-20, … and use the percentage of occurrence per bin doesn’t seem right. The resolution of binning might also affect the results.

What I am looking for:

  • An approach / starting point on how to model this data structure?
  • ideas for key-words / search-words to find more information?
  • Is “time series” the right search-word to go look for further information, even when the data might not be autocorrelated?
  • Do you know any vignette along which I can find my way?
  • Am I am worrying to much and should I just use the data with one line per observation consisting of the same properties over and over again, but only the force column differing?

Thanks for any suggestions and ideas!

Hi,
sorry, it took long to get to your post. Unfortunately, I am left with only a very vague idea about your problem, probably because I lack the domain-specific information about your problem. In particular, what is a “load collective”? I’ve never heard this term before and quick Googling didn’t help much. We also need to know your final objective - what do you want to use the model for? What questions are you trying to answer?

I don’t think I can follow this description. My best guess is that say your specimens data look like:

speciment  outcome    size    weight
A          15         10      1    
A          7          11      0.6   

and the load collective has something like

load  size   weight
725   5      0.5
711   5      0.6
706   5      0.7
...
941   6      0.5
903   6      0.6
...
1526  15    1.9
1542  15    2.0

But then I am at loss what would you want to do with the data…

What do we know about it? Mathematical form? Empirical distribution?

I don’t think so.

1 Like

Sorry for not expressing the modelling question clearly. First I will try to fix the vocabulary “load collective” better translates to “load spectrum”.

There is empirical knowledge about how often loads of different sizes happen. Basically, it is a recording over a long time (the 10.000 measurements) where I can draw from.
Once the forces are not generated by one single process but determined by a couple of factors this spectrum (histogram of loads), is at least bimodal, but might be more.

The idea is that I draw from this spectrum the loads that the part to be tested has seen before it breaks. I am not sure if the fact that I draw from this histogram is relevant here or if the loads also could have been directly observed for each line.

The specimen itself is not much different from an inanimate carbon rod. It has material properties, dimensions and the load (a force in N) is applied to it. One force value at a time until in the last application of force it breaks. Each force value itself is not high enough to lead directly to a failure, but the sum will sooner or later cause the rod to fail. The response variable is the number of times the force is applied.

tibble::tribble(
  ~LINE_NUM, ~SPECIMEN_NUM, ~DIAMETER, ~HARDNESS, ~LOAD_APPLIED, ~FAILED,
          1,             1,       173,       202,           561,       0,
          2,             1,       173,       202,           345,       0,
          3,             1,       173,       202,           117,       0,
          4,             1,       173,       202,           689,       0,
          5,             1,       173,       202,           429,       1,
          6,             2,       181,       198,           592,       0,
          7,             2,       181,       198,           792,       0,
          8,             2,       181,       198,           843,       1,
          9,             3,       175,       205,           443,       0,
         10,             3,       175,       205,           582,       0,
         11,             3,       175,       205,           521,       0,
         12,             3,       175,       205,           328,       1
  )

The target value to be modelled is the number of load cycles a specimen withstand before it will break. For the above example the outcome would be:
Specimen 1: 5 cycles (load applications) to failure
Specimen 2: 3 cycles (load applications) to failure
Specimen 3: 4 cycles (load applications) to failure

I am looking for the posterior distribution of cycles to failure given the load spectrum and the specimen properties.

I am not sure if my problem to find an approach is also related to the fact that I don’t have (or don’t want to make) an assumption in which way the experienced loads add up. This is something I would like to find out with the model. Is it just the plain integral over all loads or have high loads or loads from a specific region more impact on earlier failure?

Thanks for your help.

Yes, that clarifies a lot, thanks. Just to set up expectations: I believe what you are trying to model looks quite challenging and setting up a good model would likely be a lot of work.

The leaves me with two interpretations of your data:

  1. When the experiment is run, the values of LOAD_APPLIED are chosen from the “load spectrum”, the load is then applied and you measure whether the specimen failed
  2. You don’t actually know which loads were applied to the specimen, only the number of times an (unknown) force was applied before the specimen broke, so you impute the unknown loads by drawing from the “load spectrum”.

Which is it? Or is it something else?

I think you would need to assume at least something about the shape of the “damage adding” function. You could try using some very flexible shape families (splines or Gaussian processes), but those tend to be hard to fit and computationally expensive (mostly GPs) or hard to put good priors on (mostly splines), sometimes both.
It would be a lot easier, if you have some theory to constrain the rough shape of this relationship (e.g. damage is proportional to the square of load applied). At the very least, I hope we can assume tthat the function is growing (larger load always produces larger damage).

I any case I think that you would need to introduce some model of latent (unobserved) damage to the rod. Let’s start with a model where all the rods are identical for simplicity. Let’s use l_{s,c} for load applied to specimen s at cycle c, and the unobserved damage to specimen s after cycle c as d_{s,c}, finally, y_{s,c} is a binary variable indicating whether the specimen s broke after cycle c.

Assuming a family of functions f with parameters \theta (which are unknown) that map load to damage, the model could look like something:

d_{s,0} = 0 \\ d_{s,c} = d_{s, c-1} + f(l_{s,c}, \theta) \\ logit(p_{s,c}) = d_{s,c} + \beta \\ y_{s,c} \sim Bernoulli(p_{s,c})

In this formulation d is on the scale of “log odds” while p_{s,c} is directly the probability of breaking.
The parameters of the rod could then be another inputs to f, or further linear terms to compute p_{s,c} or both. (the final part of the model is just normal logistic regression). There is a whole lot of other parts that could be changed - maybe we could instead say that d_{s,c} is actually relative damage, constrain it between 0 and 1 and have d_{s,c} = d_{s, c-1} + (1 - d_{s,c-1}) f(l_{s,c}, \theta) where f would now compute the relative damage done (so f also outputs a number between 0 and 1). In this formulation, it would probably make sense to have directly y_{s,c} \sim Bernoulli(d_{s,c}). Not sure which one makes more sense physically.

In this model, the damage is deterministic. If the damage is stochastic, you might need parameters for the stochasticity as well. But since the yes/no failure response is a very poor signal, I would first try to assume that all the stochasticity can be absorbed into Bernoulli step (I believe you could find a family of noise terms that always get absorbed this way, but I am not good enough in math to actually find it).

I think some known time series models could be identical to the case where f is linear, but I would be surprised if noticable amount of work has been done for the case where f is non-linear.

Sorry for mentioning it at all: “You don’t actually know which loads were applied”. True for some portion of the specimen. But I have very good reason to believe that they have seen a representative sample from the spectrum.

Thank you very much for the valuable hints and direction. Also, the assessment that there is no well-established standard solution is a valuable piece of information, which helps me to invest the energy in the right places for this model. Gaussian processes are a currently a black box for me.

I will proceed as follows:

  • Continue modeling for now with the median load of each spectrum
  • In the next step choose a physically reasonable bin-width, this should prevent overfitting. Include each load-bin with a percent value; all load-bins will add up to 1. So to treat each load-bin as a property. Separately will the number of load-cycles scale the data.
  • Try to work me throu the Bernoulli approach

I think I have seen some time series model that could work in the direction, but I have seen them prior to this case and I did not manage to find them again until now. Maybe I am also mistaken that they would work well. Thanks again for your help!

Marv

Actually, thinking about it, if you don’t know which load were applied, it could actually make the model simpler (and worse at prediction) - in this case, the individual cycles are not distinguishable from each other, so you might as well just treat the number of cycles before breaking as your response/dependent variable and a Negative Binomial regression might work just fine, without any assumptions about the damage process: One way to interpret the neg. binomial is that you have trials (cycles) that can succeed (no damage) or fail (damage done) and you record the number of trials (cycles) it takes before some r failures (enough failed/damaging cycles that the specimen breaks). The downside is that if you don’t know the actual loads applied, you probably cannot learn anything about the actual damage process.

If you know the loads applied for some specimens than you could obviously use the “full” damage model and treat the other specimens as having missing data - there is some discussion of this in the User’s guide: https://mc-stan.org/docs/2_21/stan-users-guide/missing-data-and-partially-known-parameters.html
Instead of binning the empirical distribution, it might be sensible to approximate it with some kernel density and use this density as a custom _lpdf function, or have a continuous variable in the [0,1] interval indicating the quantile with respect to the kernel approximation (caveat: I tried to make this work once and failed, but there were other complicating factors).

But maybe just using the number of cycles before failure as response with a neg. binom regression could be a quick start to help you better understand the data in any case.

Best of luck with your model!

I will check the model for robustness first; for the parts I know. Than check with the known loads applied and a reasonable binning how much can be explained by this data. With the result I hope to make an assumption about the actual damage process and go on from there. Thank you very much!