Jenkins / Travis / CI Issues

@bgoodri re: linux node: I think I might have figured out what was wrong? when it was running too many g++ copies (seems to be more than ~12 ish with the --std=c++14; it used much less ram before) it would run out of memory and use up a ton of swap, which would bring the entire system to a halt for a time period. I have it now set to 1 executor with 12 cores, and I also turned off swap on the machine temporarily (it will come back with a reboot or a swapon); this seems to have helped the issue. I turned the swap off so that the processes would be killed by the kernel instead of swamping the system - seemed preferable. It had some absurd amount of swap, I think 32GB, I’m guessing not on an SSD(?).

1 Like

What’s the current status of Jenkins? Looks like a lot of jobs queued. (I can’t tell if math is being tested twice per pull request.)

More importantly:

  • can we trust the status of the Stan and Math libraries on the develop branch?
  • are we testing the same things as before the pipelines?
  • what do we need to think about / do so that it’s back to running in the background without a lot of effort? (And hopefully without a large back log)

At some point, I had considered running the distribution tests on the Columbia cluster. Nearly figured out the execution, but didn’t know how to get the results back.

Current status: lots of jobs queued. Nothing is being tested twice, though the Stan repo is set up to test PRs against the latest develop as we are talking about in the other thread.

As far as I know, things should be better than ever on develop. There was an issue where Math’s develop was breaking Stan’s develop, described here (due to the old tests not doing a full upstream test suite).

We are testing more in at least two ways (and nothing that was tested isn’t getting tested), the first of which I am open to turning off:

  1. we tests PRs as merged against the latest develop, and run new builds when develop is updated
  2. we run the exact full upstream test job instead of a simulacrum of one

This is exactly what the pipeline development and the other thread is about. I’m also looking into offloading tests to EC2, but ideally we don’t have to use this option if we make better tradeoffs about how we test.

I think you are mostly experiencing this as more of a foreground issue because I’m noticing the increased load (mostly due to more people submitting pull requests, which is awesome!) and trying to change things somewhat drastically to adapt, with your permission.

Thanks for the update, @seantalts!

Completely agreed that it’s great that more people are submitting pull requests! But that’s not why I see this as a foreground issue. It’s more the backlog:

The backlog used to get deep, but not to this level. I couldn’t tell if it was the new pipelines showing up as separate jobs. Plus… I’m now seeing all the failure emails. I should jump off that list again.

For the first part of 1, that seems right. The second part of 1, we used to not test that unless we thought there would be a problem.

Great that you’ve got 2 worked out.

Cool. If you’ve got things under control, then I’ll just let it be.

From the outside, it doesn’t looks so clean (but that’s just a perception thing). What I really don’t want are big policy changes being shooed in unilaterally for the sake of convenience. I don’t mind if we make the policy changes, but if it’s something as impactful as changing the stability of our code base, then we should really not try to push that without walking through the impact of it (both pros and cons), unless it’s clear that it’s a win everywhere.

yesterday I fixed a couple of merge conflicts in pending PRs - that might have added to the queue.

2 Likes

I think part of it is that more things have been parallelized, a little more in the old and way more in the new jobs. So the “build queue (20)” is not especially helpful in figuring out what the actual load is; easier to look at each top-level job and see what’s queued up at that level.

I fixed the emails; not sure if they were turned off on purpose or what. The last one to that list had been in 2014, but now they’re going there again (which I think is something you directly asked me to do - happy to change it to whatever scheme you think is appropriate).

This is mostly right - when I started we’d test the merge result whenever someone pushed to the PR branch for everything but the upstream tests, but those are now doing the merge result too even on the old jobs’ upstream tests.

Does this count as a policy change that the community decided together that we now should put to some kind of vote to see how we want to change it? Or is this a tradeoff that you made based on what Jenkins did automatically? This is a relevant question because I’m not sure how to get that old behavior back; the options in the new system are either keep it tested as merged against the latest develop or test just the branch. There might be a way to do some custom scripting to get the old behavior back.

What looks dirty to you? The build queue? Trying to get that under control with these other changes, some of which you are saying are policy changes I am shooing.

You’re painting this as a very black-and-white “either we want develop to be stable or not” because you’ve decided in a fairly binary fashion what you think can break develop based on your experience. In reality it’s always been a bunch of frankly uninteresting tradeoffs that I doubt literally anyone else is aware have been happening under the hood. To reiterate: develop has never been fully protected. Did you know the upstream tests weren’t tested against the PR merge result? I doubt anyone else did.

PS Thank you Mitzi!

1 Like

I think this is fine how you have it.

Here, dirty wasn’t what I was trying to get across. It’s just a lot of chatter recently about Jenkins. I’m not saying anything is broken or should have been done differently! (What I think would have been clean is if we had pipelines working before cutting over pieces at a time and having the new box tested on dummy projects before bringing them into our testing framework.)

Sorry if it seems as if I’m painting that picture. Sometimes it’s really hard to communicate intent over written communication.

The point about develop being stable is that up until now, we’ve explicitly stated in our doc (Stan manual), we’ve assumed it was true, and we’ve tried, to our best effort, to keep develop on Math and Stan in a good state. It wasn’t always that way, but we switched to that model a while ago. It was something the devs at the time sat down and we discussed. We weighed the pros and cons (at the time, which are very different from now), and that’s what we did.

Deviating from that knowingly seems like a very big change that we should give the same amount of respect. Everything that makes Math and Stan safer isn’t really a policy change. Going from as-safe-as-we-know-how to something less than that is something we should discuss. I’m not opposed to it, but I think we should just slow our roll before making this decision. I know we wouldn’t have made as much progress to this point if it weren’t stable, so before we change that, we should just think hard about it. (I’m ok with the change once we really evaluate it.)

? I don’t think that was true:

  • the PR in math kicked off tests that includes upstream tests for Stan (testing develop with the PR merged in), then for CmdStan (with Stan with the Math develop with the PR merged in)
  • after the PR is merged, this used to create a PR from Math to Stan with the updated submodule. This is where the branch would again be tested again Stan and CmdStan.

btw, thanks for that clarification! That’s what I was trying to figure out. I used to know how to determine load and whatever I used to do isn’t appropriate now.

Thanks! I think it’s either all or none; all to a list is fine. People can subscribe if they want to know every Jenkins job.

You’re right about the PR thing - forgot about that. I changed that after I changed the PR jobs to test the merged code upstream.

Btw, the queue is down to 1 job now!

FWIW, Jenkins died in the middle of a PR this morning - request to retest ignored.

What PR?

https://github.com/stan-dev/cmdstan/pull/579

I need to write a post on the new pipeline builds (was hoping to get the Math one done more quickly than it has been) but the new way to retrigger a job is to click through to the Jenkins job and click “Build” (or build with parameters - no need to enter any parameters for the default). I just did this and it re-ran successfully.

I’m also looking into why the connection keeps dropping between master and slave. Hopefully the queue will die down and I can restart the boxes with upgraded JDK and Jenkins plugins.

3 Likes

@seantalts - many thanks!