Jenkins / Travis / CI Issues

If anyone is having Jenkins and Travis issues (in the past couple of days or from now on) please leave a message here and I can look into them.

Thanks!

1 Like

Travis has not worked reliably for rstan and rstanarm for years now. I am planning to just switch them to use the Jenkins in my office.

1 Like

Yeah, among other things Iā€™ve seen Travis sometimes take an extra 15 or 20 minutes over ā€œnormalā€ to run tests (thus causing them to time out). I keep breaking them up into smaller and smaller chunks every time I see thisā€¦

Just saw the linux node timeout again, which looks like this in the logs:

FATAL: command execution failed
Command close created at
	at hudson.remoting.Command.<init>(Command.java:60)
	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1132)
	at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1130)
	at hudson.remoting.Channel.close(Channel.java:1290)
	at hudson.remoting.Channel.close(Channel.java:1272)
	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1137)
Caused: hudson.remoting.Channel$OrderlyShutdown
	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1138)
	at hudson.remoting.Channel$1.handle(Channel.java:535)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
Caused: java.io.IOException: Backing channel 'gelman-group-linux' is disconnected.
	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
	at com.sun.proxy.$Proxy99.isAlive(Unknown Source)
	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1138)
	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1130)
	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:736)
	at hudson.model.Build$BuildExecution.build(Build.java:206)
	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:496)
	at hudson.model.Run.execute(Run.java:1737)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:419)
Build step 'Execute shell' marked build as failure
ERROR: Step ā€˜Scan for compiler warningsā€™ failed: no workspace for Stan - Tests - Integration #513
ERROR: Step ā€˜Publish JUnit test result reportā€™ failed: no workspace for Stan - Tests - Integration #513
ERROR: gelman-group-linux is offline; cannot locate JDK8u66
Finished: FAILURE

I donā€™t know. I guess we just have to live with it until someone figures out a fix.

I still canā€™t even find the agent log filesā€¦ Iā€™m updating the SSH slaves plugin on Jenkins. Mind if I try Oracleā€™s JRE just to try things that shouldnā€™t work but might anyway?

OK (this is a complete sentence).

Baha. Okay, just updated it to oracle jre8 (some warning message told me 9 was not advisable yet) and updated the SSH slave plugin. Letā€™s hope that helpsā€¦

@mitzimorris wrote to me:

this PR hung: https://github.com/stan-dev/stan/pull/2391
requests to retest didnā€™t work.

Sheā€™s right, as posted above the linux node went down again yesterday. Iā€™m not sure why her request to retest didnā€™t workā€¦

@mitzimorris experienced another Jenkins weirdness - running Stan Pull Request - Upstream - CmdStan on the same machine as Stan Pull Request - Tests - Unit seems to cause the former test to fail with errors like this:

[ RUN      ] CmdStan.optimize_newton
unknown file: Failure
C++ exception with description "bad lexical cast: source type value could not be interpreted as target" thrown in the test body.
[  FAILED  ] CmdStan.variational_meanfield (174 ms)
[ RUN      ] CmdStan.variational_fullrank
unknown file: Failure
C++ exception with description "bad lexical cast: source type value could not be interpreted as target" thrown in the test body.
[  FAILED  ] CmdStan.optimize_newton (24 ms)
[----------] 4 tests from CmdStan (43 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (43 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CmdStan.optimize_newton

 1 FAILED TEST
make: *** [test/interface/optimization_output_test] Error 1

(from http://d1m1s1b1.stat.columbia.edu:8080/job/Stan%20Pull%20Request%20-%20Upstream%20-%20CmdStan/597/console)

Anyone have any ideas about why this might be? @syclik or @Bob_Carpenter might know Jenkins or if the CmdStan tests use some global files or something? Iā€™m not even sure how the jobs ran at the same time given that the upstream tests run in a different, non-parallelized phase of Stan Pull Request.

CmdStan shouldnā€™t have a problem.

There are a few places where we could have trouble:

  • maybe the way we run it from within the tests. If you look at src/test/utility.hpp, weā€™re running things using popen. Maybe thatā€™s failing under the new linux boxes? I donā€™t think thereā€™s a difficulty with multiple popens in parallel.
  • CmdStan uses pointers (in the argument parsing) and thereā€™s a possibility that itā€™s not safe somewhere.
  • I googled that exception. It looks like itā€™s a boost::lexical_cast exception. We can try to trace where thatā€™s happening. There are a few places where we use lexical_cast.

To summarize the current status, here are things that I think have been causing flakiness:

  1. My original change to the old jobs to allow testing against pull requests on forks has encountered a couple of corner cases so far that have caused spurious failures.
  2. Adding the linux box back in and trying to figure out how to use it without it imploding / its network connection dropping. I think itā€™s in a pretty good state for the past week or so, finally.
  3. Github going down (semi-rare but Iā€™ve seen it a few times).
  4. Two jobs running simultaneously that conflict in ways I donā€™t totally understand
    There might be more Iā€™m missing - anyone have others?

I think the vast majority have been due to #2, and pipelines are supposed to give us better job isolation (dealing with #4), better robustness in the face of node failure (#2), and tools to add retries etc. to help deal with things like #3. I think the mechanism behind #1 is a little better in pipeline land as well as its getting its parameters from a plugin with commercial support that seems a little more robust than the old ā€œGithub Pull Request Builderā€ plugin.

Daniel, what kind of stuff could we simplify or coalesce to add robustness or save time, respectively?

I donā€™t think my experimentation with pipelines have been affecting the other jobs, other than that they are also pull requests being tested and thus adding testing load.

Quotes from @syclik in an email thread:

Bob, regarding more robust alternatives. One thing we can do is simplify and just have things take more time.

Simple is good. It depends howmuch more time weā€™re talking
aboutā€”itā€™s already very time consuming.

broke things up into smaller chunks so that we could get finer-grained information out about the failure, but if weā€™re willing to give that up, it makes life easier.

We need to get as much information as the pull requester is
going to need to debug.

There are things we canā€™t always control like GitHub going down.

Understood.

If we wanted to simplify, we could have just one project to test Math, one project to test Stan. I think weā€™d still want to test that project over a number of different configurations, but itā€™d be one project. That would allow us to easily run multiple pull requests in parallel. Right now, Math is tested across something like 6 different projects.

Pros of multiple projects:

  • Post-processing of each of those projects is done separately. We can check for things like gcc warnings.
  • We can run multiple pieces of testing a single pull request in parallel. This makes it quicker for us to determine if a pull request has failed.
  • The project is sort of descriptive and just seeing which project failed indicates what needs to be fixed.

Cons of multiple projects:

  • For Math, this means maintaining 6 different projects.
  • Jenkins has to maintain 6 different workspaces, merging 6 times against the same branch. (Space)
  • Post-processing the log isnā€™t feasible when itā€™s one project. Scanning the log for gcc warnings wonā€™t actually work using the built-in plugins. (Iā€™ve tried, but a long time ago.)
  • We canā€™t easily run multiple pull requests in parallel easily. I think we can, but weā€™d need even more storage for copies of multiple projects.
  • Itā€™s hard to tell whatā€™s going on. I believe @seantaltsā€™s work with pipelines should clean a lot of this up, but itā€™s still easier to see whatā€™s happening when you see that the one project for the repo failed.
  • Triggering other jobs properly is harder than it seems. Hopefully pipelines fixes that a bit too. So, having one project is a bit easier.

It also hits GitHub less often, so we might have less issues with their downtime.

I think the pipelines will get the best of both worlds here, except that right now theyā€™re set up to be super parallel and Iā€™ve done it in a lazy way which involves checking out the git repo on each machine. I can look into changing that to use a new stashing and unstashing feature to spread the git repo across parallelized nodes, which might legit give us the best of both. Parallelization progress bar visualization is a little messed up right now but I suspect that will get better in future versions, and failures and output are still clearly visible on a per-stage basis (See this and this general stage view for some examples).

Iā€™ll look into the stashing thing!

1 Like

Stashing seems like it might be a decent solution! Though itā€™s breaking something weird right now, I think I will be able to eventually get that settled and then we can talk to github just once at the beginning of the build (and wrap that in a retry block).

Another issue I just encountered is that the Stan src/test/performance tests donā€™t work on the linux node (due to linux or g++, Iā€™m not sure). I knew this already but didnā€™t realize until today that Mathā€™s Upstream - Stan tests also did the performance tests. This resulted in error messages that look like this:

src/test/performance/logistic_test.cpp:111
Value of: first_run[0]
  Actual: -66.1493
Expected: -65.7658
lp__: index 0

(from here, but this link will eventually stop working).

1 Like

What do you mean by ā€œstashingā€? (It sounds great no matter what it is.)

Letā€™s split this off into its own thread and fix the problem for good!

Right now itā€™s hardware + compiler dependent, which makes it not a good test at all. Iā€™ve mentioned before: the purpose of the test has really expanded from just timing to a crude integration test (due to real bugs that were introduced and this was the easiest thing to adapt to prevent future bugs).

Stashing is basically just asking Jenkins to tar up the working directory (or some subset of it) and then unstash it on new nodes on demand, and making that pretty easy and hiding the inter-machine communication aspects. You can see some light doc here.

It seems to be working now! It takes ~3 minutes to unstash the first time on a new machine (full Stan + Math repos) but it only copies over the network once, so the 2nd time is only a few seconds. I think this is worth it since it means we talk to github way less, and that whole checkout process for both repos could take a minute or two on its own anyway.

Hereā€™s the build for the PR with all the bells and whistles hooked up in the last two builds: http://d1m1s1b1.stat.columbia.edu:8080/job/Stan%20Pipeline/view/change-requests/job/PR-2414/

1 Like

Regarding email notification, we had set up a Google Group called stan-buildbot so the notification went to a list that people could subscribe to. I donā€™t think it should just go to the one email address. (Iā€™m guessing you havenā€™t even seen these emails?)

---------- Forwarded message ----------
From: ...@gmail.com
Date: Fri, Oct 13, 2017 at 7:07 AM
Subject: [StanJenkins] SUCCESSFUL: Job ā€˜Stan Pipeline/PR-2414 [29]ā€™
To: ...@gmail.com

SUCCESSFUL: Job ā€˜Stan Pipeline/PR-2414 [29]ā€™: Check console output at http://d1m1s1b1.stat.columbia.edu:8080/job/Stan%20Pipeline/job/PR-2414/29/

  1. We can add another email address to always send to. What address should I put in? I canā€™t actually figure out from that page how I would send a message to that group, haha.
  2. Sending mail to the buildbotā€™s gmail address is an edge case I didnā€™t consider - Right now the job is set up to email the developers who have commits that were newly tested by the job (more or less, logic is somewhat fuzzy but automatic from the plugin). The buildbot is the one who automatically creates the commit that updates Stanā€™s develop branch to point to the latest math develop that passes the tests, and so it gets an email when that job finishes :P