If anyone is having Jenkins and Travis issues (in the past couple of days or from now on) please leave a message here and I can look into them.
Thanks!
If anyone is having Jenkins and Travis issues (in the past couple of days or from now on) please leave a message here and I can look into them.
Thanks!
Travis has not worked reliably for rstan and rstanarm for years now. I am planning to just switch them to use the Jenkins in my office.
Yeah, among other things Iāve seen Travis sometimes take an extra 15 or 20 minutes over ānormalā to run tests (thus causing them to time out). I keep breaking them up into smaller and smaller chunks every time I see thisā¦
Just saw the linux node timeout again, which looks like this in the logs:
FATAL: command execution failed
Command close created at
at hudson.remoting.Command.<init>(Command.java:60)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1132)
at hudson.remoting.Channel$CloseCommand.<init>(Channel.java:1130)
at hudson.remoting.Channel.close(Channel.java:1290)
at hudson.remoting.Channel.close(Channel.java:1272)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1137)
Caused: hudson.remoting.Channel$OrderlyShutdown
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1138)
at hudson.remoting.Channel$1.handle(Channel.java:535)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
Caused: java.io.IOException: Backing channel 'gelman-group-linux' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
at com.sun.proxy.$Proxy99.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1138)
at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1130)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:736)
at hudson.model.Build$BuildExecution.build(Build.java:206)
at hudson.model.Build$BuildExecution.doRun(Build.java:163)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:496)
at hudson.model.Run.execute(Run.java:1737)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:419)
Build step 'Execute shell' marked build as failure
ERROR: Step āScan for compiler warningsā failed: no workspace for Stan - Tests - Integration #513
ERROR: Step āPublish JUnit test result reportā failed: no workspace for Stan - Tests - Integration #513
ERROR: gelman-group-linux is offline; cannot locate JDK8u66
Finished: FAILURE
I donāt know. I guess we just have to live with it until someone figures out a fix.
I still canāt even find the agent log filesā¦ Iām updating the SSH slaves plugin on Jenkins. Mind if I try Oracleās JRE just to try things that shouldnāt work but might anyway?
OK (this is a complete sentence).
Baha. Okay, just updated it to oracle jre8 (some warning message told me 9 was not advisable yet) and updated the SSH slave plugin. Letās hope that helpsā¦
@mitzimorris wrote to me:
this PR hung: https://github.com/stan-dev/stan/pull/2391
requests to retest didnāt work.
Sheās right, as posted above the linux node went down again yesterday. Iām not sure why her request to retest didnāt workā¦
@mitzimorris experienced another Jenkins weirdness - running Stan Pull Request - Upstream - CmdStan
on the same machine as Stan Pull Request - Tests - Unit
seems to cause the former test to fail with errors like this:
[ RUN ] CmdStan.optimize_newton
unknown file: Failure
C++ exception with description "bad lexical cast: source type value could not be interpreted as target" thrown in the test body.
[ FAILED ] CmdStan.variational_meanfield (174 ms)
[ RUN ] CmdStan.variational_fullrank
unknown file: Failure
C++ exception with description "bad lexical cast: source type value could not be interpreted as target" thrown in the test body.
[ FAILED ] CmdStan.optimize_newton (24 ms)
[----------] 4 tests from CmdStan (43 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (43 ms total)
[ PASSED ] 3 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] CmdStan.optimize_newton
1 FAILED TEST
make: *** [test/interface/optimization_output_test] Error 1
Anyone have any ideas about why this might be? @syclik or @Bob_Carpenter might know Jenkins or if the CmdStan tests use some global files or something? Iām not even sure how the jobs ran at the same time given that the upstream tests run in a different, non-parallelized phase of Stan Pull Request.
CmdStan shouldnāt have a problem.
There are a few places where we could have trouble:
src/test/utility.hpp
, weāre running things using popen
. Maybe thatās failing under the new linux boxes? I donāt think thereās a difficulty with multiple popen
s in parallel.boost::lexical_cast
exception. We can try to trace where thatās happening. There are a few places where we use lexical_cast
.To summarize the current status, here are things that I think have been causing flakiness:
I think the vast majority have been due to #2, and pipelines are supposed to give us better job isolation (dealing with #4), better robustness in the face of node failure (#2), and tools to add retries etc. to help deal with things like #3. I think the mechanism behind #1 is a little better in pipeline land as well as its getting its parameters from a plugin with commercial support that seems a little more robust than the old āGithub Pull Request Builderā plugin.
Daniel, what kind of stuff could we simplify or coalesce to add robustness or save time, respectively?
I donāt think my experimentation with pipelines have been affecting the other jobs, other than that they are also pull requests being tested and thus adding testing load.
Quotes from @syclik in an email thread:
Bob, regarding more robust alternatives. One thing we can do is simplify and just have things take more time.
Simple is good. It depends howmuch more time weāre talking
aboutāitās already very time consuming.
broke things up into smaller chunks so that we could get finer-grained information out about the failure, but if weāre willing to give that up, it makes life easier.
We need to get as much information as the pull requester is
going to need to debug.
There are things we canāt always control like GitHub going down.
Understood.
If we wanted to simplify, we could have just one project to test Math, one project to test Stan. I think weād still want to test that project over a number of different configurations, but itād be one project. That would allow us to easily run multiple pull requests in parallel. Right now, Math is tested across something like 6 different projects.
Pros of multiple projects:
Cons of multiple projects:
It also hits GitHub less often, so we might have less issues with their downtime.
I think the pipelines will get the best of both worlds here, except that right now theyāre set up to be super parallel and Iāve done it in a lazy way which involves checking out the git repo on each machine. I can look into changing that to use a new stashing and unstashing feature to spread the git repo across parallelized nodes, which might legit give us the best of both. Parallelization progress bar visualization is a little messed up right now but I suspect that will get better in future versions, and failures and output are still clearly visible on a per-stage basis (See this and this general stage view for some examples).
Iāll look into the stashing thing!
Stashing seems like it might be a decent solution! Though itās breaking something weird right now, I think I will be able to eventually get that settled and then we can talk to github just once at the beginning of the build (and wrap that in a retry
block).
Another issue I just encountered is that the Stan src/test/performance
tests donāt work on the linux node (due to linux or g++, Iām not sure). I knew this already but didnāt realize until today that Mathās Upstream - Stan tests also did the performance tests. This resulted in error messages that look like this:
src/test/performance/logistic_test.cpp:111
Value of: first_run[0]
Actual: -66.1493
Expected: -65.7658
lp__: index 0
(from here, but this link will eventually stop working).
What do you mean by āstashingā? (It sounds great no matter what it is.)
Letās split this off into its own thread and fix the problem for good!
Right now itās hardware + compiler dependent, which makes it not a good test at all. Iāve mentioned before: the purpose of the test has really expanded from just timing to a crude integration test (due to real bugs that were introduced and this was the easiest thing to adapt to prevent future bugs).
Stashing is basically just asking Jenkins to tar up the working directory (or some subset of it) and then unstash it on new nodes on demand, and making that pretty easy and hiding the inter-machine communication aspects. You can see some light doc here.
It seems to be working now! It takes ~3 minutes to unstash the first time on a new machine (full Stan + Math repos) but it only copies over the network once, so the 2nd time is only a few seconds. I think this is worth it since it means we talk to github way less, and that whole checkout process for both repos could take a minute or two on its own anyway.
Hereās the build for the PR with all the bells and whistles hooked up in the last two builds: http://d1m1s1b1.stat.columbia.edu:8080/job/Stan%20Pipeline/view/change-requests/job/PR-2414/
Regarding email notification, we had set up a Google Group called stan-buildbot so the notification went to a list that people could subscribe to. I donāt think it should just go to the one email address. (Iām guessing you havenāt even seen these emails?)
---------- Forwarded message ----------
From: ...@gmail.com
Date: Fri, Oct 13, 2017 at 7:07 AM
Subject: [StanJenkins] SUCCESSFUL: Job āStan Pipeline/PR-2414 [29]ā
To: ...@gmail.comSUCCESSFUL: Job āStan Pipeline/PR-2414 [29]ā: Check console output at http://d1m1s1b1.stat.columbia.edu:8080/job/Stan%20Pipeline/job/PR-2414/29/
develop
branch to point to the latest math develop
that passes the tests, and so it gets an email when that job finishes :P