Troubleshooting =============== Common deployment problems and their fixes. CUDA / RAPIDS version mismatch ------------------------------ * Symptom: ``ImportError: libcudart.so`` or ``RuntimeError: CUDA error`` when importing ``cudf``/``cuml``. * Fix: Verify that your installed RAPIDS packages (``cudf``, ``cuml``, ``dask-cuda``) match the CUDA driver version. GENBoostGPU targets the ``25.8`` release on CUDA 12.x. Recreate the environment with matching conda channels or upgrade the host driver. Out of memory (OOM) ------------------- * Lower ``batch_size`` and ``working_set['K']`` to reduce per-iteration memory. * Enable RAPIDS Memory Manager by exporting ``RMM_POOL_SIZE=12GB`` before running scripts or rely on the default ``LocalCUDACluster`` settings. * Chunk genotype loading or downsample windows. ``select_tuning_windows`` can help prioritise informative regions first. Pandas / NumPy pinning conflicts -------------------------------- * Symptom: ``ImportError`` complaining about binary incompatibilities between ``pandas`` and ``numpy`` or ``pandas-plink``. * Fix: Use the version constraints shipped in ``pyproject.toml`` (``pandas>=2.3`` and ``numpy<2.3``). If building docs on CPU, install those exact versions before ``pandas-plink`` to avoid ABI mismatches. Dask connection issues ---------------------- * Symptom: Workers fail to connect or time out when launching multi-GPU runs. * Fixes: - Ensure all nodes share the same CUDA/RAPIDS versions and NCCL libraries. - Disable the dashboard (default) or set ``--dashboard-address=:8787`` to a free port. - Export ``DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s`` for clusters with slow startup. - When running locally, try ``CUDA_VISIBLE_DEVICES=0`` to force single-GPU mode and confirm the pipeline works before scaling out.