Performance & scaling
GENBoostGPU is designed to scale from a single GPU workstation to multi-GPU servers managed by Dask. This page collects best practices for keeping runs fast and memory-efficient.
Execution modes
- Single GPU
If
numba.cuda.gpusreports exactly one device,genboostgpu.orchestration.run_windows_with_dask()runs windows serially without spinning up Dask. This mode is ideal for debugging or laptop-scale experiments.- Multi GPU
When more than one GPU is visible, the orchestrator launches a
dask_cuda.LocalCUDACluster. Windows are submitted asynchronously and throttled withmax_in_flightto balance throughput and memory usage. From v0.2.0 onward,max_in_flightdefaults to2 * num_gpusso the scheduler keeps enough work queued without exhausting memory on 4+ GPU nodes.
Memory management tips
Reduce
batch_sizeand theworking_set['K']parameter for massive windows. Smaller batches lower peak GPU memory at the cost of more iterations.Use chunked genotype loading if PLINK files do not fit in device memory. Wrap
load_genotypeswith your own CuPymemmaplogic and passgeno_arras needed windows are processed.Enable RAPIDS Memory Manager pools. Either set
RMM_POOL_SIZE=12GBin the environment or rely on the defaultrmm_pool_size="12GB"thatgenboostgpu.orchestration.run_windows_with_dask()passes to Dask.Pre-allocate CuPy memory pools for repeated runs:
import cupy as cp cp.cuda.set_allocator(cp.cuda.MemoryPool().malloc)
Prefer Parquet outputs via
save=True; they can be reloaded lazily for downstream aggregation without keeping everything in memory.
Distributed execution
Pin workers to GPUs using
CUDA_VISIBLE_DEVICESor the--CUDA_VISIBLE_DEVICESflag when launching viadask-scheduler/dask-cuda-worker.Keep the dashboard disabled on headless clusters (default behaviour) to reduce port conflicts on shared systems.
Set
DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=Trueto let idle GPUs pull windows from busy workers when jobs vary in size.
Environment checklist
RAPIDS_VERSIONshould match thecudf/cumlwheels you installed.CUDA_VISIBLE_DEVICEScontrols which GPUs are used; set it to a comma-separated list or leave it unset to use all devices.OPTUNA_STORAGE(e.g.,sqlite:///optuna.db) turns on persistent study storage for large sweeps.GENBOOSTGPU_LOG_LEVEL=INFO(custom environment variable) can be exported to surface more orchestration logs if you hook it inside your scripts.