Performance & scaling

GENBoostGPU is designed to scale from a single GPU workstation to multi-GPU servers managed by Dask. This page collects best practices for keeping runs fast and memory-efficient.

Execution modes

Single GPU: If numba.cuda.gpus reports exactly one device, genboostgpu.orchestration.run_windows_with_dask() runs windows serially without spinning up Dask. This mode is ideal for debugging or laptop-scale experiments.
Multi GPU: When more than one GPU is visible, the orchestrator launches a dask_cuda.LocalCUDACluster. Windows are submitted asynchronously and throttled with max_in_flight to balance throughput and memory usage. From v0.2.0 onward, max_in_flight defaults to 2 * num_gpus so the scheduler keeps enough work queued without exhausting memory on 4+ GPU nodes.

Memory management tips

Reduce batch_size and the working_set['K'] parameter for massive windows. Smaller batches lower peak GPU memory at the cost of more iterations.
Use chunked genotype loading if PLINK files do not fit in device memory. Wrap load_genotypes with your own CuPy memmap logic and pass geno_arr as needed windows are processed.
Enable RAPIDS Memory Manager pools. Either set RMM_POOL_SIZE=12GB in the environment or rely on the default rmm_pool_size="12GB" that genboostgpu.orchestration.run_windows_with_dask() passes to Dask.

Pre-allocate CuPy memory pools for repeated runs:

import cupy as cp
cp.cuda.set_allocator(cp.cuda.MemoryPool().malloc)

Prefer Parquet outputs via save=True; they can be reloaded lazily for downstream aggregation without keeping everything in memory.

Distributed execution

Pin workers to GPUs using CUDA_VISIBLE_DEVICES or the --CUDA_VISIBLE_DEVICES flag when launching via dask-scheduler/dask-cuda-worker.
Keep the dashboard disabled on headless clusters (default behaviour) to reduce port conflicts on shared systems.
Set DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=True to let idle GPUs pull windows from busy workers when jobs vary in size.

Environment checklist

RAPIDS_VERSION should match the cudf/cuml wheels you installed.
CUDA_VISIBLE_DEVICES controls which GPUs are used; set it to a comma-separated list or leave it unset to use all devices.
OPTUNA_STORAGE (e.g., sqlite:///optuna.db) turns on persistent study storage for large sweeps.
GENBOOSTGPU_LOG_LEVEL=INFO (custom environment variable) can be exported to surface more orchestration logs if you hook it inside your scripts.