Troubleshooting
Common deployment problems and their fixes.
CUDA / RAPIDS version mismatch
Symptom:
ImportError: libcudart.soorRuntimeError: CUDA errorwhen importingcudf/cuml.Fix: Verify that your installed RAPIDS packages (
cudf,cuml,dask-cuda) match the CUDA driver version. GENBoostGPU targets the25.8release on CUDA 12.x. Recreate the environment with matching conda channels or upgrade the host driver.
Out of memory (OOM)
Lower
batch_sizeandworking_set['K']to reduce per-iteration memory.Enable RAPIDS Memory Manager by exporting
RMM_POOL_SIZE=12GBbefore running scripts or rely on the defaultLocalCUDAClustersettings.Chunk genotype loading or downsample windows.
select_tuning_windowscan help prioritise informative regions first.
Pandas / NumPy pinning conflicts
Symptom:
ImportErrorcomplaining about binary incompatibilities betweenpandasandnumpyorpandas-plink.Fix: Use the version constraints shipped in
pyproject.toml(pandas>=2.3andnumpy<2.3). If building docs on CPU, install those exact versions beforepandas-plinkto avoid ABI mismatches.
Dask connection issues
Symptom: Workers fail to connect or time out when launching multi-GPU runs.
Fixes:
Ensure all nodes share the same CUDA/RAPIDS versions and NCCL libraries.
Disable the dashboard (default) or set
--dashboard-address=:8787to a free port.Export
DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60sfor clusters with slow startup.When running locally, try
CUDA_VISIBLE_DEVICES=0to force single-GPU mode and confirm the pipeline works before scaling out.