Jenkins down?

@seantalts, looks like Jenkins is down. Mind restarting it?

If there’s some procedure to restart safely, please post somewhere; I’d be happy to restart it if it goes down like this again.

Looks like it’s up again.

It’s not really up. It’s out of heap space. I hit the “safe restart” button.

@seantalts, mind helping me figure out what’s going on with this?
http://d1m1s1b1.stat.columbia.edu:8080/job/Stan/view/change-requests/job/PR-2601/54/console

It’s been sitting like that for a while.

And when I look at the job it’s waiting on, it looks like it’s waiting for that particular executor to finish to run the next job:

The non-windows interface tests with MPI are set to use only the linux node, as that’s the OS we officially support I think.

Here is the file and line where this is set: https://github.com/stan-dev/cmdstan/blob/develop/Jenkinsfile#L93

I don’t remember the full story behind why that one is Linux only, but maybe there’s something in the git history. If you want to submit a PR changing it to run on either Mac or Linux, that might be a good way to get your feet wet with this new job DSL and declarative pipelines.

After 8+ hours of waiting, it managed to just start the other jobs. I’m not sure what happened. I’m guessing it hit some timeout? Weird thing is that it didn’t fail a build.

How do you deal with deadlocks with Jenkins? I think that’s what it was.

I don’t think there’s a timeout and I’ve never seen a deadlock before with these builds - the way I engineered them not to create deadlocks is to not reserve nodes for the main thread of execution. That’s done with the agent none at the top. I would be curious to see something that looked like a deadlock as it’s happening in order to investigate. Do you have any records of this?

What tipped me off was captured in the two screenshots above. Sorry… should have added more context.

In the bottom screenshot, it says “CmdStan >> downstream_tests” is running on “gelman-group-linux”. The top screenshot is from that log where it was just waiting for “Starting building: CmdStan >> Pr-645 #28”. It was already finished with all the integration tests and was just waiting.

Going back to the bottom screenshot, I think “part of CmdStan >> PR-645 #28” was waiting on the next available executor on gelman-group-linux. And that job was waiting for this new job in the queue to finish.

That’s why I thought there was some deadlock that managed to break out due to some sort of timeout, but that could all be wrong. I just don’t know why it would wait like that for hours and hours without any sort of output.

Once it finished the job, the queue got down to 0 really quickly.

I think those builds were all legitimately waiting for an available executor on the Linux node, and there was a legitimate job on that node. The closest thing I’ve seen so far has been that some of your makefile PRs have spawned infinite loops on Jenkins, which obviously causes jobs to pile up and something close to a deadlock to appear.

Ok. That makes sense. I’ll let you know if it pops up again.

And yes, those infinite loops weren’t cool. Sorry about that.

1 Like