Troubleshooting

Common deployment problems and their fixes.

CUDA / RAPIDS version mismatch

Symptom: ImportError: libcudart.so or RuntimeError: CUDA error when importing cudf/cuml.
Fix: Verify that your installed RAPIDS packages (cudf, cuml, dask-cuda) match the CUDA driver version. GENBoostGPU targets the 25.8 release on CUDA 12.x. Recreate the environment with matching conda channels or upgrade the host driver.

Lower batch_size and working_set['K'] to reduce per-iteration memory.
Enable RAPIDS Memory Manager by exporting RMM_POOL_SIZE=12GB before running scripts or rely on the default LocalCUDACluster settings.
Chunk genotype loading or downsample windows. select_tuning_windows can help prioritise informative regions first.

Symptom: ImportError complaining about binary incompatibilities between pandas and numpy or pandas-plink.
Fix: Use the version constraints shipped in pyproject.toml (pandas>=2.3 and numpy<2.3). If building docs on CPU, install those exact versions before pandas-plink to avoid ABI mismatches.

Symptom: Workers fail to connect or time out when launching multi-GPU runs.
Fixes:
- Ensure all nodes share the same CUDA/RAPIDS versions and NCCL libraries.
- Disable the dashboard (default) or set --dashboard-address=:8787 to a free port.
- Export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s for clusters with slow startup.
- When running locally, try CUDA_VISIBLE_DEVICES=0 to force single-GPU mode and confirm the pipeline works before scaling out.