GPU CI testing

¯\_(ツ)_/¯

Can you chmod as root to give everyone access to libopencl.so? I think that would do it

It was not installed systemwide, but is now. It is technically named

/usr/lib/x86_64-linux-gnu/libOpenCL.so.1
2 Likes

Has anything changed with sundials downstream?

The current GPU branch is failing because the windows computer cannot find some sundials functions it needs. Searching for ‘N_VMake_Serial’ on the link below will show what I’m talking about

http://d1m1s1b1.stat.columbia.edu:8080/job/Stan/job/downstream%20tests/421/execution/node/89/log/

Not sure how to resolve this so any help would be v appreciated

Yes, it has. I’ll take a look later.

Is the pull request up to date with develop?

It was last night I’ll rn

Just merged the latest updates, but they were all for MPI so I’m not sure if that will solve my issue

It looks like Stan needs to be updated. Will get on that.

Ty!

Jenkins seems to be throwing an error because my email changed.

Not sending mail to user [my_email_address] with no permission to view Math Pipeline » PR-885 #11Sending email to: [my_email_address]

remote file operation failed: /Users/Shared/Jenkins/gelman-group-mac/workspace/Math_Pipeline_PR-885-44NXYESDMGOQXZOKNHAZLVBDHULVNPUBLA57ANR6E66GEWV6FDHQ at hudson.remoting.Channel@3b2b5bda:gelman-group-mac: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on gelman-group-mac failed. The channel is closing down or has closed down

Does anyone have access to Jenkins today and can update my email address to be recognized? I can dm you my email to add.

I don’t think that build is failing because your email changed - emailing happens after a build fails. I have a PR to print messages surrounding any email failures so that people aren’t confused, but develop is broken. Also I don’t think there’s a place to like, add a person’s email to Jenkins…? I guess somehow your github email and what Jenkins thinks is your github email got out of sync, but I don’t think there’s a place where you can set that…

It looks like the Mac stopped responding immediately after starting the GPU tests?

Though the compilation worked fine. So I suspect that the process segfaulted or something?

It looks like the Mac stopped responding immediately after starting the GPU tests?

If you go to this page and ctrl + f for

[Unit with GPU] make -j16 test/unit/math/rev/scal/fun/trigamma_test test/unit/math/rev/scal/fun/trunc_test 

at the link below

http://d1m1s1b1.stat.columbia.edu:8080/blue/rest/organizations/jenkins/pipelines/Math%20Pipeline/branches/PR-885/runs/11/log/?start=0

it looks like it fails right below where it tries to run deleteDir. I’m not really sure why, could it be a permissions error somehow? I can try to change my email back to my old one (tbh I don’t even remember changing it but I must have)

Error I’m seeing is

[Pipeline] [Unit with GPU] sh
[Pipeline] [Unit with GPU] junit
Post stage
[Pipeline] [Unit with GPU] retry
[Pipeline] [Unit with GPU] {
[Pipeline] [Unit with GPU] deleteDir
[Pipeline] [Unit with GPU] }
[Unit with GPU] ERROR: Execution failed
[Unit with GPU] java.io.EOFException

I think that ‘sh’ is running ./runTests.py on the GPU stuff and then immediately failing; see these views:

I really thought people would look at the first error instead of the last more than they have been… lol

Oh interesting. We haven’t touched the Jenkins or makefile for this PR. Do you know what else would cause that?

I suspect that the binary linking and trying to use opencl is segfaulting. I imagine that could be due to permissions or some other device related stuff? I’m not sure. I can try to get you ssh access to try it out on the machine, assuming it works locally?

Yeah everything was running fine locally. If you can get me an ssh that would be great!

Odd, I just merged master into the gpu branch and now it’s failing on the MPI tests?

http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/PR-885/12/pipeline/79

There’s a bunch of warnings, but this is the first error I’m seeing


error: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/install_name_tool: more than one input file specified (lib/boost_1.66.0/stage/lib/libboost_mpi.dylib and Pipeline/branches/PR-885/workspace/lib/boost_1.66.0/stage/lib)

Usage: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/install_name_tool [-change old new] ... [-rpath old new] ... [-add_rpath new] ... [-delete_rpath old] ... [-id name] input

make: *** [lib/boost_1.66.0/stage/lib/libboost_mpi.dylib] Error 1

Looking over the changes we don’t mess with anything in the makefile so I think this is a Jenkins thing. If you can pm me the ssh stuff I can go take a look at this.

About 24h’s ago we merged some updates I did to the MPI make stuff into develop… and looking at the error you are getting, there is actually a mistake in the makefiles which is triggered by the Jenkins setup as there is a space in the absolute path. I will fix that tomorrow morning… or you just do it on your branch. What you need to do is to add in the line 76 and 82 the right quotation to $(BOOS_LIB_ABS) which needs to read as "$(BOOST_LIB_ABS)". Sorry for that.

2 Likes