Jenkins Updates, Issues & Requests

Hey, I’m creating this thread so anyone can track any Jenkins related issues or requests and be able to post any errors one may encounter and get a quick response.

Work in progress:

  • Investigate why CmdStan Performance Tests ( master ) are failing. These started failing with hepatitisMe having 45% up/down each run. I have investigated this with no success and Sean also tried to help, we assume it’s a change in the scripts that caused it.
  • Docs ( devops )

Pending:

  • Monorepo Jenkins CI

Stuck:

  • Stan Unit/Integration tests on all OSes. Stuck at windows thread
  • Make use of the new Windows GPU machine in Jenkins math #1528
  • Known bug in Custom Performance Job that won’t show windows results.
2 Likes
  • Jenkins EC2 Linux on-demand are getting stuck at some point randomly. ( related to #156. It will be fixed in the next plugin update.

Update for the plugin has been released. I’ve updated it and rebooted Jenkins.

Does this cover also Travis problems? I hope I’m not hijacking, but I’ve been hitting some random failures here over a few days: https://github.com/stan-dev/stan/pull/2865

Thanks @serban-nicusor!

One thing that also needs some investigation is the occasional fails on Cmdstan tests on the Windows interface tests. Its not a huge problem as cmdstan tests are fast and we can restart, but these false fails are annoying.

They are related to the TBB. What I know of the issue:

  • we are unable to set the path to the TBB library on Jenkins
  • we resort to copying tbb.dll to the test model folders to circumvent this issue
  • because we run tests in parallel there are issues with multiple threads trying to move tbb.dll and then it fails in some cases

Example: https://jenkins.mc-stan.org/blue/organizations/jenkins/CmdStan/detail/develop/500/pipeline

The cleanest solution would be that the path to TBB would be properly added to the environment variable PATH. The other case would be that we reduce the parallelism in these tests to reduce the chances of this happening. It still will though.

If the PATH is properly set you should not see prints like these:

Intel TBB is not in PATH.

'stan/lib/stan_math/lib/tbb/tbb.dll' -> 'src/test/test-models/tbb.dll'

@mcol Yes it does!
I’ve created a PR that fixed the apt-get error. Please see #2868

@rok_cesnovar I think to fix that we just need to implement with_env into cmdstan Jenkinsfile. I will create a PR to verify in a bit.

1 Like

Here’s a different problem which doesn’t seem to be related to the changes I’ve introduced: https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1545/3/pipeline

ar: creating lib/sundials_4.1.0/lib/libsundials_nvecserial.a
ar: unable to rename 'lib/sundials_4.1.0/lib/libsundials_nvecserial.a'; reason: Permission denied
make/libraries:72: recipe for target 'lib/sundials_4.1.0/lib/libsundials_nvecserial.a' failed
mingw32-make: *** [lib/sundials_4.1.0/lib/libsundials_nvecserial.a] Error 1

Hey @rok_cesnovar I’ve created a PR for the cmdstand issue here #788 please take a look when you have some time, I’ve tested it a few runs and it looks stable. Still running some more just to be sure it won’t randomly fail :) I’ve seen no add tbb to path in the logs

@mcol I am aware of that but couldn’t find a real reason behind that, I’ll keep looking into it.

2 Likes

On-Demand Windows EC2 instances are working again now, plugin has been updated which fixed some issues.
I’ve re-enabled it now, please do let me know if you encounter any errors/issues with them.
Thanks

1 Like

Thanks @serban-nicusor

I think Jenkins and the test nodes are mostly running great now, which is very much appreciated. The only thing that pops up now and then is the “Permission denied” error on Windows machines with the sundails library .a files. If we find a fix for that it should be smooth sailing.

My hunch is that its related to the parallel building of these files and multiple threads trying to build the same file at once. I am going to try and debug this issue by first getting it to fail non-randomly. Maybe the easiest solution would be to build these files with a single thread first and then run the tests.

Would it be possible to build the libraries only when we need them (I mean, build them only whenever they change and keep them as binary around)? As an alternative we can think about ccache … maybe restrict the use of ccache to the libraries via calling out make first on the libraries (and only in that call fold in the ccache stuff). That should speed up to some degree the build.

A new error is occurring on the Windows Headers and Units phase (see
https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1584/11/pipeline/174 and https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1590/4/pipeline/174 for example):

cc1plus.exe: out of memory allocating 65536 bytes

cc1plus.exe: out of memory allocating 65536 bytes

cc1plus.exe: out of memory allocating 65536 bytes

<...>

Hey,

related to the permission denied error … well I’ve looked into it and I’m not sure exactly how to fix it.
@rok_cesnovar Is it happening for you locally when multi-threading ? If not well … it’s something in between Jenkins and Windows machines that I have to find a way to debug. Maybe related to Java limits

@mcol Thanks for reporting! We had that before and what fixed it is allocating more memory to the process. If you can take a look here it says about compiling UTF-16, may not be the case but better be sure.
I will look now into it and give it more memory, should fix it.

Edit: Memory increased, please do let me know if this happens again. Thanks!
Edit: I’ve merged the PR for performance-tests which will print better, more easy to read results into GitHub. If you encounter any errors please let me know. Thanks!

2 Likes

Just informing that the memory thing happened again: https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1471/49/pipeline/

1 Like

Thank you @rok_cesnovar for letting me know!
I’ve narrowed it down to swap so I’ve increased it from 7.5MB to 25GB.
I may be wrong but this should be included in the docs somewhere, I think locally it would fail if you have very low swap. ( 7.5mb is almost nothing )

2 Likes

Have these changes happened? They seem to occur almost all the time now, for example https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1590/11/pipeline is just the last one of a long series.

Yes, they did.
That error happens only on the EC2 windows instances that’s why the sudden spike in this kind of error.
I saw some PRs running fine after the swap increase so I’m not sure what happened here, restarted your job now to see how it goes. If it’s still happening then swap wasn’t a proper fix for it. If the case I will disable the instances so we avoid further failures until I find a proper fix for it.
I’m sorry for the inconvenience caused by it.

Edit: The EC2 instances were still an issue after the swap increase, I’ve disabled them for now until I figure out what’s causing the out of memory exception.

3 Likes

@serban-nicusor, could you document how everything’s laid out? I think this page has gotten pretty out of date: https://github.com/stan-dev/stan/wiki/Jenkins

and I don’t know where there’s current documentation about our devops for the project.

For specific requests, in the Math repo, I’d like to consider having Jenkins pass PRs when the contents of the PR does not touch the stan/ or test/ folders, or maybe more specifically, if it’s not touching source code. Things like updating README doesn’t need to trigger a test… we can move more fluidly around that.

Also, maybe we can have a separate thread discussing how we can bring testing time / resources down. We can do it if we leverage the dependencies of C++ headers properly. We’ve lost some of that over the years, but with the latest efforts in cleaning up the code base, we should be able to get back to that.

3 Likes

+1

Two minor things on this topic:

  • Windows Headers & Unit that take close to 2 hours to run, 1:20-1:50 of which for compiling. I am not sure that we could do much there as long as we are stuck with Rtools 3.5 and gcc 4.9.3. The 4.0 release is coming soon (end of February), but as long as we state that we support gcc 4.9.3+ that is what we should be testing.

  • Distribution tests on merge to develop on the gelman-group-linux run for 6+ hours. If they run anywhere else (I think its the cloud instances) they run for 2-3hours, which is fine. Even the 6 hour run time would not be a problem, as we dont need to wait for merges to develop tests for PRs. The issue is that the gelman-group-linux is one of two Jenkins workers that are eligible to run Full unit with GPU tests. So that becomes the bottlneck then. Maybe we dont use this instance for distribution tests at least until we fix and merge https://github.com/stan-dev/math/pull/1528 which would add another GPU
    Jenkins worker.

1 Like

@syclik I’ll focus on bringing the documentation up to date. I also need to write a lot about infra behind Jenkins since we moved it to AWS and we’ve added a docker container repository.
It does make sense to skip jobs that aren’t touching source code, I can implement it based on git diff, will look into it and create PRs. For all repos I suppose stan, cmdstan, math, stanc3

I would advise to keep the testing time/resources talks here so we have it in one place since it’s related to Jenkins.

Edit: Created a cronjob for the new windows machine since it’s running out of space. Will clean the work space every day :)
Started working on updating Jenkins documentation here to include changes, infrastructure and disaster recovery.

1 Like

Does anyone know who’s paying for this and how?

I’m getting overdue bill notices forwarded to me from Amazon, but I wasn’t involved in setting up payments. I’ve forwarded to @jonah but haven’t gotten any feedback.

I think @seantalts and the @SGB are in charge of this now, but Sean’s leaving, so we need to sort out how to maintain it going forward.

P.S. I’m happy to communicate some other way with the SGB than tagging them in Discourse messages—Discourse gripes at me every time I do this.

1 Like