Performance & scaling
=====================

GENBoostGPU is designed to scale from a single GPU workstation to multi-GPU
servers managed by Dask. This page collects best practices for keeping runs fast
and memory-efficient.

Execution modes
---------------

Single GPU
   If ``numba.cuda.gpus`` reports exactly one device, :func:`genboostgpu.orchestration.run_windows_with_dask`
   runs windows serially without spinning up Dask. This mode is ideal for
   debugging or laptop-scale experiments.
Multi GPU
   When more than one GPU is visible, the orchestrator launches a
   :class:`dask_cuda.LocalCUDACluster`. Windows are submitted asynchronously and
   throttled with ``max_in_flight`` to balance throughput and memory usage. From
   v0.2.0 onward, ``max_in_flight`` defaults to ``2 * num_gpus`` so the scheduler
   keeps enough work queued without exhausting memory on 4+ GPU nodes.

Memory management tips
----------------------

* Reduce ``batch_size`` and the ``working_set['K']`` parameter for massive
  windows. Smaller batches lower peak GPU memory at the cost of more iterations.
* Use chunked genotype loading if PLINK files do not fit in device memory. Wrap
  ``load_genotypes`` with your own CuPy ``memmap`` logic and pass ``geno_arr`` as
  needed windows are processed.
* Enable RAPIDS Memory Manager pools. Either set ``RMM_POOL_SIZE=12GB`` in the
  environment or rely on the default ``rmm_pool_size="12GB"`` that
  :func:`genboostgpu.orchestration.run_windows_with_dask` passes to Dask.
* Pre-allocate CuPy memory pools for repeated runs:

  .. code-block:: python

     import cupy as cp
     cp.cuda.set_allocator(cp.cuda.MemoryPool().malloc)

* Prefer Parquet outputs via ``save=True``; they can be reloaded lazily for
  downstream aggregation without keeping everything in memory.

Distributed execution
---------------------

* Pin workers to GPUs using ``CUDA_VISIBLE_DEVICES`` or the ``--CUDA_VISIBLE_DEVICES``
  flag when launching via ``dask-scheduler``/``dask-cuda-worker``.
* Keep the dashboard disabled on headless clusters (default behaviour) to reduce
  port conflicts on shared systems.
* Set ``DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=True`` to let idle GPUs pull
  windows from busy workers when jobs vary in size.

Environment checklist
---------------------

* ``RAPIDS_VERSION`` should match the ``cudf``/``cuml`` wheels you installed.
* ``CUDA_VISIBLE_DEVICES`` controls which GPUs are used; set it to a comma-separated
  list or leave it unset to use all devices.
* ``OPTUNA_STORAGE`` (e.g., ``sqlite:///optuna.db``) turns on persistent study
  storage for large sweeps.
* ``GENBOOSTGPU_LOG_LEVEL=INFO`` (custom environment variable) can be exported to
  surface more orchestration logs if you hook it inside your scripts.