Troubleshooting
===============

Common deployment problems and their fixes.

CUDA / RAPIDS version mismatch
------------------------------

* Symptom: ``ImportError: libcudart.so`` or ``RuntimeError: CUDA error`` when
  importing ``cudf``/``cuml``.
* Fix: Verify that your installed RAPIDS packages (``cudf``, ``cuml``,
  ``dask-cuda``) match the CUDA driver version. GENBoostGPU targets the
  ``25.8`` release on CUDA 12.x. Recreate the environment with matching conda
  channels or upgrade the host driver.

Out of memory (OOM)
-------------------

* Lower ``batch_size`` and ``working_set['K']`` to reduce per-iteration memory.
* Enable RAPIDS Memory Manager by exporting ``RMM_POOL_SIZE=12GB`` before running
  scripts or rely on the default ``LocalCUDACluster`` settings.
* Chunk genotype loading or downsample windows. ``select_tuning_windows`` can
  help prioritise informative regions first.

Pandas / NumPy pinning conflicts
--------------------------------

* Symptom: ``ImportError`` complaining about binary incompatibilities between
  ``pandas`` and ``numpy`` or ``pandas-plink``.
* Fix: Use the version constraints shipped in ``pyproject.toml`` (``pandas>=2.3``
  and ``numpy<2.3``). If building docs on CPU, install those exact versions before
  ``pandas-plink`` to avoid ABI mismatches.

Dask connection issues
----------------------

* Symptom: Workers fail to connect or time out when launching multi-GPU runs.
* Fixes:

  - Ensure all nodes share the same CUDA/RAPIDS versions and NCCL libraries.
  - Disable the dashboard (default) or set ``--dashboard-address=:8787`` to a free
    port.
  - Export ``DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=60s`` for clusters with
    slow startup.
  - When running locally, try ``CUDA_VISIBLE_DEVICES=0`` to force single-GPU mode
    and confirm the pipeline works before scaling out.