@bgoodri re: linux node: I think I might have figured out what was wrong? when it was running too many g++ copies (seems to be more than ~12 ish with the --std=c++14; it used much less ram before) it would run out of memory and use up a ton of swap, which would bring the entire system to a halt for a time period. I have it now set to 1 executor with 12 cores, and I also turned off swap on the machine temporarily (it will come back with a reboot or a swapon
); this seems to have helped the issue. I turned the swap off so that the processes would be killed by the kernel instead of swamping the system - seemed preferable. It had some absurd amount of swap, I think 32GB, Iâm guessing not on an SSD(?).
Whatâs the current status of Jenkins? Looks like a lot of jobs queued. (I canât tell if math is being tested twice per pull request.)
More importantly:
- can we trust the status of the Stan and Math libraries on the
develop
branch? - are we testing the same things as before the pipelines?
- what do we need to think about / do so that itâs back to running in the background without a lot of effort? (And hopefully without a large back log)
At some point, I had considered running the distribution tests on the Columbia cluster. Nearly figured out the execution, but didnât know how to get the results back.
Current status: lots of jobs queued. Nothing is being tested twice, though the Stan repo is set up to test PRs against the latest develop as we are talking about in the other thread.
As far as I know, things should be better than ever on develop. There was an issue where Mathâs develop was breaking Stanâs develop, described here (due to the old tests not doing a full upstream test suite).
We are testing more in at least two ways (and nothing that was tested isnât getting tested), the first of which I am open to turning off:
- we tests PRs as merged against the latest develop, and run new builds when develop is updated
- we run the exact full upstream test job instead of a simulacrum of one
This is exactly what the pipeline development and the other thread is about. Iâm also looking into offloading tests to EC2, but ideally we donât have to use this option if we make better tradeoffs about how we test.
I think you are mostly experiencing this as more of a foreground issue because Iâm noticing the increased load (mostly due to more people submitting pull requests, which is awesome!) and trying to change things somewhat drastically to adapt, with your permission.
Thanks for the update, @seantalts!
Completely agreed that itâs great that more people are submitting pull requests! But thatâs not why I see this as a foreground issue. Itâs more the backlog:
The backlog used to get deep, but not to this level. I couldnât tell if it was the new pipelines showing up as separate jobs. Plus⌠Iâm now seeing all the failure emails. I should jump off that list again.
For the first part of 1, that seems right. The second part of 1, we used to not test that unless we thought there would be a problem.
Great that youâve got 2 worked out.
Cool. If youâve got things under control, then Iâll just let it be.
From the outside, it doesnât looks so clean (but thatâs just a perception thing). What I really donât want are big policy changes being shooed in unilaterally for the sake of convenience. I donât mind if we make the policy changes, but if itâs something as impactful as changing the stability of our code base, then we should really not try to push that without walking through the impact of it (both pros and cons), unless itâs clear that itâs a win everywhere.
yesterday I fixed a couple of merge conflicts in pending PRs - that might have added to the queue.
I think part of it is that more things have been parallelized, a little more in the old and way more in the new jobs. So the âbuild queue (20)â is not especially helpful in figuring out what the actual load is; easier to look at each top-level job and see whatâs queued up at that level.
I fixed the emails; not sure if they were turned off on purpose or what. The last one to that list had been in 2014, but now theyâre going there again (which I think is something you directly asked me to do - happy to change it to whatever scheme you think is appropriate).
This is mostly right - when I started weâd test the merge result whenever someone pushed to the PR branch for everything but the upstream tests, but those are now doing the merge result too even on the old jobsâ upstream tests.
Does this count as a policy change that the community decided together that we now should put to some kind of vote to see how we want to change it? Or is this a tradeoff that you made based on what Jenkins did automatically? This is a relevant question because Iâm not sure how to get that old behavior back; the options in the new system are either keep it tested as merged against the latest develop
or test just the branch. There might be a way to do some custom scripting to get the old behavior back.
What looks dirty to you? The build queue? Trying to get that under control with these other changes, some of which you are saying are policy changes I am shooing.
Youâre painting this as a very black-and-white âeither we want develop
to be stable or notâ because youâve decided in a fairly binary fashion what you think can break develop
based on your experience. In reality itâs always been a bunch of frankly uninteresting tradeoffs that I doubt literally anyone else is aware have been happening under the hood. To reiterate: develop
has never been fully protected. Did you know the upstream tests werenât tested against the PR merge result? I doubt anyone else did.
PS Thank you Mitzi!
I think this is fine how you have it.
Here, dirty wasnât what I was trying to get across. Itâs just a lot of chatter recently about Jenkins. Iâm not saying anything is broken or should have been done differently! (What I think would have been clean is if we had pipelines working before cutting over pieces at a time and having the new box tested on dummy projects before bringing them into our testing framework.)
Sorry if it seems as if Iâm painting that picture. Sometimes itâs really hard to communicate intent over written communication.
The point about develop
being stable is that up until now, weâve explicitly stated in our doc (Stan manual), weâve assumed it was true, and weâve tried, to our best effort, to keep develop
on Math and Stan in a good state. It wasnât always that way, but we switched to that model a while ago. It was something the devs at the time sat down and we discussed. We weighed the pros and cons (at the time, which are very different from now), and thatâs what we did.
Deviating from that knowingly seems like a very big change that we should give the same amount of respect. Everything that makes Math and Stan safer isnât really a policy change. Going from as-safe-as-we-know-how to something less than that is something we should discuss. Iâm not opposed to it, but I think we should just slow our roll before making this decision. I know we wouldnât have made as much progress to this point if it werenât stable, so before we change that, we should just think hard about it. (Iâm ok with the change once we really evaluate it.)
? I donât think that was true:
- the PR in math kicked off tests that includes upstream tests for Stan (testing develop with the PR merged in), then for CmdStan (with Stan with the Math develop with the PR merged in)
- after the PR is merged, this used to create a PR from Math to Stan with the updated submodule. This is where the branch would again be tested again Stan and CmdStan.
btw, thanks for that clarification! Thatâs what I was trying to figure out. I used to know how to determine load and whatever I used to do isnât appropriate now.
Thanks! I think itâs either all or none; all to a list is fine. People can subscribe if they want to know every Jenkins job.
Youâre right about the PR thing - forgot about that. I changed that after I changed the PR jobs to test the merged code upstream.
Btw, the queue is down to 1 job now!
FWIW, Jenkins died in the middle of a PR this morning - request to retest ignored.
What PR?
I need to write a post on the new pipeline builds (was hoping to get the Math one done more quickly than it has been) but the new way to retrigger a job is to click through to the Jenkins job and click âBuildâ (or build with parameters - no need to enter any parameters for the default). I just did this and it re-ran successfully.
Iâm also looking into why the connection keeps dropping between master and slave. Hopefully the queue will die down and I can restart the boxes with upgraded JDK and Jenkins plugins.