I hadn’t tried that. When X2 is built in the transformed data block, the utilization picks up (e.g., below). Though, at least in my application this won’t work since X needs to be built from model parameters.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.130 Driver Version: 418.130 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID P100-8Q On | 00000000:02:04.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 819MiB / 8192MiB | 28% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24541 C ./normal_glm 291MiB |
+-----------------------------------------------------------------------------+